UnIVAL: A Multimodal Model for Generalist Tasks with Unified Architecture

**Title**: Introducing UnIVAL: A Multimodal Model for Language Challenges

**Introduction**

In the world of artificial intelligence (AI), Large Language Models (LLMs) have played a significant role in text understanding and generation. However, LLMs are limited in their ability to access information outside of text. To overcome this limitation, researchers at Sorbonne University have developed UnIVAL, a multimodal model capable of addressing various tasks using different modalities. This article explores the features and significance of UnIVAL in the field of AI.

**UnIVAL: A Multimodal Model**

UnIVAL stands out as the first model that successfully solves language challenges related to pictures, videos, and audio using a unified architecture, vocabulary, input/output format, and training aim. Unlike other models, UnIVAL does not require extensive training data or a large model size. Despite its smaller size, UnIVAL performs on par with previous models tailored to specific modalities. The researchers achieved new State-of-the-Art (SoTA) results on several tasks using similarly sized models.

Furthermore, the researchers discovered the value of multitask pretraining compared to traditional single-task pretraining. Pretraining the model on additional modalities enhances its ability to generalize to untrained modalities. For example, UnIVAL, when fine-tuned on audio-text problems, can achieve competitive performance without prior audio pretraining.

The researchers also conducted a new investigation into merging multimodal models by weight interpolation. Weight interpolation successfully combines the skills of multiple fine-tuned weights, resulting in more robust multitask models without additional inference overhead. This research is the first to demonstrate the effectiveness of weight interpolation with multimodal baseline models.

**Limitations of UnIVAL**

Despite its remarkable capabilities, UnIVAL has two significant drawbacks. Firstly, it is susceptible to hallucinations, leading to the invention of new objects in visual descriptions. This object bias prioritizes consistency over accuracy. Secondly, UnIVAL struggles with complex instructions, such as identifying one object from a group of similar ones, locating objects at extreme distances, or recognizing numbers.

**Conclusion**

The introduction of UnIVAL marks an important milestone in AI research, as it addresses the limitations of previous models by offering a unified approach to multimodal tasks. By demonstrating the value of multitask pretraining and weight interpolation, UnIVAL paves the way for the development of new modality-agnostic generalist assistant agents. This research has the potential to accelerate progress in the field of AI and inspire other scientists to explore the possibilities of multimodal models.

**References:**
– Project: [UnIVAL](https://unival-model.github.io/)
– Paper: [UnIVAL](https://arxiv.org/abs/2307.16184)
– GitHub: [UnIVAL](https://github.com/mshukor/UnIVAL)

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...