Multimodal Large Language Models: Bridging the Gap in Visual Communication

AI News

Multimodal Large Language Models: Bridging the Gap in Visual Communication

Jimmy W.

July 8, 2023

Multimodal Large Language Models: Bridging the Gap in Visual Communication

Multimodal Large Language Models (MLLMs) have made significant progress in recent months. These models focus on Large Language Models (LLMs) that can understand visual content but struggle to communicate the exact locations of specific objects in an image. In contrast, humans can easily discuss and point to specific regions in a scene to share information effectively.

This type of communication is known as referential dialogue (RD), and it has many practical applications. For example, users can use Mixed Reality (XR) headsets to interact with an AI assistant and have the assistant show them the relevant area in their field of vision. This can be useful for tasks like online shopping or helping visual robots understand user reference points.

In this study, researchers from SenseTime Research, SKLSDE, Beihang University, and Shanghai Jiao Tong University have developed Shikra, a unified model that can handle spatial coordinates. Shikra is designed to be simple and straightforward, without the need for additional vocabularies or position encoders. It consists of an alignment layer, an LLM, and a vision encoder, making it an all-in-one solution.

Shikra can answer questions and provide justifications both verbally and geographically. It can perform tasks like Visual Question Answering (VQA), image captioning, and Referring Expression Comprehension (REC) with promising results. The researchers hope that their experiments will inspire further research in the field of MLLMs.

The key contributions of this study are as follows:

– Introducing the concept of referential dialogue (RD) and highlighting its importance in human communication.
– Presenting Shikra as a unified and practical MLLM solution for RD, without the need for additional modules or models.
– Demonstrating that Shikra can perform well on common visual language tasks like REC, PointQA, VQA, and image captioning. The code for Shikra is available on GitHub.

For more information, you can check out the research paper and GitHub link provided. Don’t forget to join the ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research and projects. If you have any questions or feedback, feel free to reach out to the authors.

Source link

LEAVE A REPLY Cancel reply