Improving Multimodal Language Models through Frozen Retrieval and Contrastive Learning

Large language models (LLMs) have impressive abilities like generating human-like discourse and answering complex questions. However, most current LLMs are trained on text-only data from the internet, limiting their understanding of the real world and their ability to process visual input. In this article, we will explore how a frozen LLM can be effectively used for multimodal input and output, combining text and images.

To enable this capability, the language model is trained to learn a new token called [RET], which represents an image for image-text retrieval. Contrastive learning is used to map the [RET] embeddings for a caption to be similar to the visual embeddings for its corresponding image. During training, only the weights of the linear layers and the [RET] token embedding are updated, keeping most of the model frozen. This approach is efficient in terms of memory and computation.

Once trained, the model demonstrates several new skills. It can engage in multimodal conversation and reasoning, in addition to generating text like a traditional text-only LLM. This approach is model-independent and can be applied to future versions of LLMs to enhance their capabilities.

The Frozen Retrieval Over Multimodal Data for Autoregressive Generation (FROMAGe) model is a prime example of the enhanced capabilities of text-to-image retrieval performed by autoregressive LLMs. FROMAGe achieves strong results using only image-caption pairs, without the need for a large-scale dataset of image-text pairs. It outperforms previous models on lengthy and complex free-form text.

FROMAGe demonstrates three main skills. First, it can retrieve images from sequences of interspersed pictures and text, providing contextual image retrieval. Second, it shows good performance on visual conversation tasks without prior training. Lastly, it improves discourse context sensitivity for image retrieval.

This research opens up opportunities for models that can learn from and generate coherent sequences of image-text pairs. It also showcases the potential of pretrained text-only LLMs for visually based tasks. The code and pretrained models will be made available to the public to encourage further research and development.

For more information, you can read the paper and explore the project and GitHub repository. Join our ML subreddit, Discord channel, and email newsletter for the latest AI research news and projects.

(Source: MarkTechPost)

Image source: [link to image]

Check out the Paper, Project, and Github. All credit for this research goes to the researchers on this project. Don’t forget to join our ML subreddit, Discord channel, and email newsletter for more AI updates.

(Note: This article has been rewritten in the third person and optimized for Google’s SEO guidelines, incorporating relevant keywords related to AI.)

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...