Large Language Models (LLMs) have revolutionized natural language understanding in recent years. They have shown impressive abilities in semantic comprehension, query resolution, and text production, especially in zero-shot and few-shot environments. In the realm of vision tasks, various methods have been proposed, as shown in Figure 1(a). One method involves training an optical encoder to represent images as continuous embeddings, allowing the LLM to understand them. Another approach uses a contrastively trained frozen vision encoder and additional layers in the frozen LLM that are trained from scratch. Another method suggests training a lightweight transformer to align a frozen visual encoder and a frozen LLM. Despite these advancements, there is still a challenge in justifying the computational cost of the additional pretraining stage(s). Furthermore, large databases of text, photos, and videos are needed to synchronize the visual and linguistic aspects with an existing LLM.
To address these issues, Flamingo introduces new cross-attention layers into a pre-trained LLM to incorporate visual features. As shown in Figure 1, there are two options for multimodal pretraining. The first option involves using a paired or web dataset, while the second option is to use LENS, a pretraining-free technique that can be applied to any off-the-shelf LLM without the need for extra multimodal datasets. Previous approaches required joint alignment pretraining on extensive multimodal datasets to achieve visual tasks. This multimodal pretraining stage requires a massive amount of picture-text pairs and websites, which can take up to 15 days, even with a pretrained image encoder and a pretrained frozen LLM. In contrast, by utilizing “vision modules,” researchers can extract information from visual inputs and generate detailed textual representations, such as tags, attributes, actions, and relationships. These representations can then be directly fed to the LLM, eliminating the need for additional multimodal pretraining.
Contextual AI and Stanford University researchers introduce LENS (Large Language Models Enhanced to See), a modular strategy that combines LLM as the “reasoning module” with separate “vision modules.” The LENS technique first extracts rich textual information using pretrained vision modules, such as contrastive models and image-captioning models. This textual information is then processed by the LLM, enabling it to perform tasks related to object recognition, vision, and language. LENS bridges the gap between the modalities without any additional multimodal pretraining stages or data. By incorporating LENS, researchers have a model that can operate across domains without the need for additional cross-domain pretraining. This integration also allows for the immediate utilization of the latest advancements in computer vision and natural language processing, maximizing the benefits of both disciplines.
The contributions of this research are as follows:
– Presentation of LENS, a modular approach that tackles computer vision challenges using the few-shot, in-context learning capabilities of language models through natural language descriptions of visual inputs.
– LENS enables any off-the-shelf LLM to possess visual understanding without the need for further training or data.
– Frozen LLMs are utilized to handle object recognition and visual reasoning tasks without additional vision-and-language alignment or multimodal data. Experimental results demonstrate that this approach achieves competitive or superior zero-shot performance compared to end-to-end pre-trained models like Kosmos and Flamingo.
A partial implementation of this research paper is available on GitHub. You can also check out the full paper, demo, GitHub link, and blog on the provided links. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research news and exciting projects. If you have any questions or feedback, feel free to email us at Asif@marktechpost.com.
– AI Tools Club: Explore hundreds of AI tools.
– Aneesh Tickoo: A consulting intern at MarktechPost, actively working on machine learning projects and research in the field of image processing. Connect with him for collaborations and interesting projects.
StoryBird.ai has released some amazing features. Generate an illustrated story from a prompt. Check it out here.