Introducing Vision Language Model (VLM): A Powerful AI System
Vision Language Model (VLM) is a cutting-edge artificial intelligence system that combines natural language understanding with image recognition abilities. Similar to OpenAI’s CLIP and Google’s BigGAN, VLMs can comprehend text descriptions and interpret images, enabling various applications in computer vision, content generation, and human-computer interaction. They have proven to be highly capable of understanding and generating text in context with visual content, making them an essential technology in the field of AI.
The Benefits of Contrastive Pre-Training for VLMs
In a recent study, researchers from Google Research, Google DeepMind, and Google Cloud compared Vision Transformer (ViT) models pre-trained with classification objectives to those pre-trained with contrastive objectives. They found that contrastive pre-trained models, specifically SigLIP-based PaLI, outperformed in tasks related to localization and text understanding. By scaling the SigLIP image encoder to 2 billion parameters, they achieved a new state-of-the-art in multilingual cross-modal retrieval. This study highlights the advantages of pre-training visual encoders on web-scale image-text data instead of classification-style data. PaLI-X, a large Vision Language Model, demonstrates the benefits of scaling up classification pretrained image encoders.
The PaLI-3 Model: Enhanced Localization and Text Understanding
The study also explores the scalability of VLMs and the importance of smaller-scale models for practicality and efficient research. The researchers introduce PaLI-3, a 5-billion-parameter VLM that shows competitive results. PaLI-3’s training process involves contrastive pre-training of the image encoder on web-scale data, improved dataset mixing, and higher-resolution training. They also introduce a 2-billion-parameter multilingual contrastive vision model. Ablation studies confirm the superiority of contrastively pretrained models, particularly in tasks related to localization and visually-situated text understanding.
Their approach utilizes a pre-trained ViT model, specifically ViT-G14, as the image encoder using the SigLIP training recipe. ViT-G14, with approximately 2 billion parameters, serves as the vision backbone for PaLI-3. The contrastive pre-training involves embedding images and texts separately and classifying their correspondence. Visual tokens from ViT’s output are projected and combined with text tokens. These inputs are then processed by a 3 billion parameter UL2 encoder-decoder language model for text generation, typically driven by task-specific prompts like VQA questions.
PaLI-3 surpasses its larger counterparts, particularly in localization and visually situated text understanding. The SigLIP-based PaLI model, with contrastive image encoder pre-training, achieves a new state-of-the-art in multilingual cross-modal retrieval. The full PaLI-3 model also outperforms the current state-of-the-art in referring expression segmentation and maintains low error rates across subgroups in detection tasks. Contrastive pre-training proves to be more effective for localization tasks. The ViT-G image encoder of PaLI-3 excels in multiple classification and cross-modal retrieval tasks.
Conclusion: The Power of Contrastive Pre-Training in VLMs
In conclusion, this research emphasizes the benefits of contrastive pre-training, specifically the SigLIP approach, for enhancing and streamlining VLMs. The smaller 5-billion-parameter SigLIP-based PaLI-3 model excels in localization and text understanding, outperforming larger counterparts in diverse multimodal benchmarks. The contrastive pre-training of the image encoder in PaLI-3 also establishes a new state-of-the-art in multilingual cross-modal retrieval. Further comprehensive investigations are needed to enhance model performance beyond image encoder pre-training in VLMs.
For more information, you can read the full research paper.
If you enjoy our work, be sure to subscribe to our newsletter.
You can also find us on WhatsApp. Join our AI Channel on WhatsApp.