OpenFlamingo v2: Enhanced Models for Multimodal Text and Image Processing

The University of Washington, Stanford, AI2, UCSB, and Google researchers have developed the OpenFlamingo project. This project aims to create models similar to DeepMind’s Flamingo team. These OpenFlamingo models can handle mixed text and image sequences to produce text as an output. Activities like captioning, visual question answering, and image classification can benefit from this model’s capabilities in taking samples in context.

The team has now released v2 of the OpenFlamingo project, which includes five trained models at the 3B, 4B, and 9B levels. These models are derived from open-source models with less restrictive licenses compared to LLaMA, such as Mosaic’s MPT-1B and 7B and Together.XYZ’s RedPajama-3B.

To create these models, the researchers followed the Flamingo modeling paradigm. They added visual characteristics to the existing layers of a pretrained static language model. While the vision encoder and language model remain static, the connecting modules are trained using web-scraped image-text sequences, similar to Flamingo.

The team tested their captioning, visual question answering, and classification models on vision-language datasets. Their findings indicate significant progress between the v1 release and the OpenFlamingo-9B v2 model.

To evaluate the efficacy of the models, the team combined results from seven datasets and tested them in five different contexts: no shots, four shots, eight shots, sixteen shots, and thirty-two shots. Comparing OpenFlamingo (OF) models at the OF-3B and OF-4B levels to Flamingo-3B and Flamingo-9B models, they found that, on average, OpenFlamingo achieves over 80% of matching Flamingo performance. The researchers also compared their results to the optimized state-of-the-art models published on PapersWithCode. The OpenFlamingo-3B and OpenFlamingo-9B models, pretrained only on online data, achieved over 55% of fine-tuned performance with 32 in-context instances. However, OpenFlamingo’s models lag behind DeepMind’s models by an average of 10% in the 0-shot context and 15% in the 32-shot context.

The team is continuously making progress in training and delivering state-of-the-art multimodal models. Their next goal is to enhance the quality of the data used for pre-training.

For more information about the OpenFlamingo project and its v2 release, you can check out the Github repository and the blog post.

Don’t forget to join the ML SubReddit, Discord Channel, and the Email Newsletter to stay updated with the latest AI research news and cool AI projects. If you have any questions or if we missed anything, feel free to email us at

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...