Pre-training visual language (VL) models on large image-caption datasets and classification datasets can improve zero-shot recognition tasks. However, directly combining these datasets can result in biased representations. In a recent study titled “Prefix Conditioning Unifies Language and Label Supervision”, presented at CVPR 2023, researchers from the Cloud AI Team and Perception Team introduce a new pre-training strategy that addresses this issue.
The researchers demonstrate that simply unifying the datasets leads to sub-optimal performance due to dataset bias. To overcome this, they propose a method called prefix conditioning. This method uses prefix tokens to disentangle biases from visual concepts, allowing the language encoder to learn from both datasets while tailoring feature extraction to each dataset.
Prefix conditioning is a versatile technique that can be easily integrated into existing VL pre-training objectives. It improves the generalization of models for zero-shot recognition tasks by using language embeddings tailored for the caption dataset.
The researchers compared the performance of models trained with prefix conditioning to models trained only on ImageNet21K (IN21K) and Conceptual 12M (CC12M). The models trained with prefix conditioning showed significant improvements in zero-shot classification accuracy.
Additionally, the researchers analyzed the impact of the prefix used during test time. They found that using the prefix tailored for the image-caption dataset improved generalization to different scene types and vocabulary words.
They also evaluated the robustness of models to image distribution shifts using ImageNet variants. The results showed that the optimal prefix may vary depending on how far the test domain is from the classification dataset.
In conclusion, prefix conditioning is an effective technique for unifying image-caption and classification datasets for better zero-shot classification. Future work includes identifying the optimal prefix for each test dataset.
This research was conducted by Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister, with valuable feedback from Zizhao Zhang and Sergey Ioffe.