Home AI News VeCLIP: Enhancing Image-Text Alignment with Visual-enriched Captions

VeCLIP: Enhancing Image-Text Alignment with Visual-enriched Captions

0
VeCLIP: Enhancing Image-Text Alignment with Visual-enriched Captions

Introducing VeCLIP: Enhancing Image-Text Alignment with Visual-enriched Captions

Large-scale web-crawled datasets are crucial for the success of vision-language models like CLIP. However, the noise and irrelevance of AltTexts from web crawling can make it challenging to align images and text accurately. Existing methods using large language models (LLMs) have shown promise on small, curated datasets.

What is VeCLIP?

This study introduces a scalable pipeline for noisy caption rewriting called VeCLIP. Unlike other LLM techniques, VeCLIP focuses on incorporating visual concepts into captions, known as Visual-enriched Captions (VeCap). To ensure diversity in the data, a mixed training scheme is proposed to optimize the use of AltTexts and newly generated VeCap.

Features of VeCLIP

VeCLIP is adapted for training CLIP on large web-crawled datasets, resulting in the creation of a 300 million sample VeCap dataset. The results show significant improvements in image-text alignment and model performance, with up to a +25.2% gain in COCO and Flickr30k retrieval tasks under the 12M setting. Additionally, VeCLIP achieves a +3% gain while using only 14% of the data used in vanilla CLIP and 11% in ALIGN.

Benefits of VeCLIP

The VeCap dataset is also complementary to well-curated datasets, enhancing performance in zero-shot classification tasks. Combining VeCap with DFN results in strong performance in both image-text retrieval and zero-shot classification tasks, such as achieving 83.1% accuracy@1 on ImageNet zero-shot for a H/14 model.

Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here