Home AI News Advances in Multimodal Transformers: Enhancing Language-Vision Grounding for AI Systems

Advances in Multimodal Transformers: Enhancing Language-Vision Grounding for AI Systems


The Significance of Multimodal Transformers in AI Systems

Language grounded in vision is a crucial aspect of AI systems in the real world. It offers utility in various tasks like visual question answering and applications such as generating descriptions for visually impaired individuals. To address this challenge, multimodal models pre-trained on image-language pairs have been developed. Among these models, multimodal transformers have emerged as a family of models that have achieved state-of-the-art performance in numerous multimodal benchmarks. This suggests that the joint-encoder transformer architecture is better suited for capturing the alignment between image-language pairs compared to previous approaches like dual encoders.

The Advantage of Multimodal Transformers over Dual Encoders

Unlike the dual-encoder architecture, where there is no interaction between different modalities, multimodal transformers, also known as joint encoders, are more efficient in terms of sample usage. This is evident when considering zero-shot image retrieval, where an existing multimodal transformer called UNITER performs similarly to the large-scale dual encoder CLIP, despite being trained on significantly less data.

Understanding the Key Aspects of Multimodal Transformers

In our study, we investigate the crucial elements that contribute to the success of multimodal transformers in multimodal pretraining. We discover that multimodal attention, which allows both language and image transformers to attend to each other, is essential for the models’ effectiveness. Models with different types of attention, even those with more depth or parameters, fail to achieve comparable results when compared to shallower and smaller models with multimodal attention. Additionally, we find that comparable results can be achieved without the originally proposed image loss formulation for multimodal transformers. This suggests that current models may not fully leverage the valuable signal present in the image modality, possibly due to issues with the image loss formulation.

Exploring Dataset Properties and their Impact

We also examine various properties of multimodal datasets, such as their size and the degree to which the language describes the corresponding image (noisiness). Our findings indicate that a dataset’s size does not always predict the performance of multimodal transformers. The noise level and language similarity to the evaluation task are both critical factors that contribute to their performance. These findings highlight the importance of curating less noisy image-text datasets, despite the prevalent practice of gathering noisy datasets from the web.

The Strengths and Challenges of Multimodal Transformers

In conclusion, our analysis demonstrates that multimodal transformers outperform dual encoder architectures when given the same amount of pretraining data. This superiority is mainly attributed to the cross-talk facilitated by multimodal attention. However, there are still several outstanding issues to address in the design of multimodal models, including developing better losses for the image modality and improving robustness to dataset noise. These challenges pave the way for future advancements in multimodal AI systems.

Source link


Please enter your comment!
Please enter your name here