Advancements and Failures of Multimodal Large Language Models

AI News

Advancements and Failures of Multimodal Large Language Models

Jimmy W.

January 18, 2024

Advancements and Failures of Multimodal Large Language Models

AI Researchers Discover Flaws in Multimodal Large Language Models

Multimodal large language models (MLLMs) combine images and large language models (LLMs) to perform tasks like visual question answering and image understanding. However, recent studies have shown that these MLLMs still have some visual flaws.

Researchers at UC Berkeley and New York University found that the flaws in MLLMs might be due to issues with the visual representations used in pretrained vision and language models. Most MLLMs use pretrained vision models, such as the Contrastive Language-Image PreTraining (CLIP) model, to process images. However, these models struggle with encoding certain visual patterns and details, leading to mistakes in downstream tasks.

To address these issues, the researchers introduced a new benchmark called MultiModal Visual Patterns (MMVP) to evaluate the visual abilities of MLLMs. They found that existing MLLMs, including GPT-4V, performed poorly on the MMVP benchmark, indicating a significant performance gap compared to human performance.

The researchers also identified nine visual patterns that CLIP models struggled with, such as orientation and viewpoint. They found that scaling the model and increasing training data did not resolve these issues. They concluded that CLIP’s visual shortcomings directly impacted the performance of MLLMs.

In response to these findings, the researchers developed a new approach called Mixture-of-Features (MoF) to enhance the visual grounding of MLLMs. By combining features from CLIP and a vision-only self-supervised model called DINOv2, they were able to improve visual anchoring without sacrificing the ability to follow instructions.

Overall, the research highlights the weaknesses of existing visual representation models and the need for new assessment metrics to improve visual representation learning. The team hopes that their work will inspire advancements in vision models and lead to better-performing MLLMs in the future.

For more information about the research, you can access the paper and Github. Follow us on Twitter and join our ML SubReddit, Facebook Community, Discord Channel, and LinkedIn Group for more updates. Don’t forget to subscribe to our newsletter for the latest AI research insights.

Source link

LEAVE A REPLY Cancel reply