The Significance of Contrastive Language Image Pretraining
The UniReps Workshop in NeurIPS 2023 has accepted a paper on contrastive language image pretraining, a method that has become the standard approach for training vision language models. This technique involves using CLIP visual features as global representations for images. However, while CLIP features are useful for many tasks, they have limitations when it comes to tasks like object localization, pixel-level understanding, and 3D perception.
The Challenge of Multi-Task Training
To address these limitations, one popular solution is multi-task training. However, creating a large-scale annotated multi-task dataset can be costly. Training on separate task-specific datasets also presents challenges, such as aligning gradients and knowledge from different input distributions and tasks.
Improving CLIP Features with Pseudo-Labeling
To overcome these challenges, researchers are exploring a new approach using pseudo-labeling with task-specific experts to enhance CLIP features for more complex downstream tasks. In this method, multiple existing pretrained models are leveraged to pseudo-label an uncurated web-scale image-caption dataset. By training CLIP with contrastive loss and task-specific losses using pseudo-labels through lightweight heads attached to the vision backbone, researchers aim to enhance the performance of CLIP on challenging tasks.