Introducing VeCLIP: Enhancing Image-Text Alignment with Visual-enriched Captions
Large-scale web-crawled datasets are crucial for the success of vision-language models like CLIP. However, the noise and irrelevance of AltTexts from web crawling can make it challenging to align images and text accurately. Existing methods using large language models (LLMs) have shown promise on small, curated datasets.
What is VeCLIP?
This study introduces a scalable pipeline for noisy caption rewriting called VeCLIP. Unlike other LLM techniques, VeCLIP focuses on incorporating visual concepts into captions, known as Visual-enriched Captions (VeCap). To ensure diversity in the data, a mixed training scheme is proposed to optimize the use of AltTexts and newly generated VeCap.
Features of VeCLIP
VeCLIP is adapted for training CLIP on large web-crawled datasets, resulting in the creation of a 300 million sample VeCap dataset. The results show significant improvements in image-text alignment and model performance, with up to a +25.2% gain in COCO and Flickr30k retrieval tasks under the 12M setting. Additionally, VeCLIP achieves a +3% gain while using only 14% of the data used in vanilla CLIP and 11% in ALIGN.
Benefits of VeCLIP
The VeCap dataset is also complementary to well-curated datasets, enhancing performance in zero-shot classification tasks. Combining VeCap with DFN results in strong performance in both image-text retrieval and zero-shot classification tasks, such as achieving 83.1% accuracy@1 on ImageNet zero-shot for a H/14 model.