Introducing the ImageBind-LLM: A Multimodality Instruction-Following Model
Researchers have made significant advancements in large language models’ (LLMs) instruction tuning. Meet ChatGPT and GPT-4, two general-purpose talking systems that can understand and follow human commands in both language and visuals. The only downside is that they are not replicable due to closed-source constraints. However, the Alpaca, LLaMAAdapter, and related projects aim to modify the publicly accessible LLaMA into language instruction models by using self-generated data.
LLaVA, LLaMA-Adapter, and others have also integrated visual understanding capabilities into LLMs for image-conditioned generation. They can tailor instructions based on pictures. While these instruction tuning techniques have been successful, there is still a need to create an LLM that can handle multimodality instructions, such as text, picture, audio, 3D point clouds, and video.
In a recent study, researchers from the Shanghai Artificial Intelligence Laboratory, CUHK MMLab, and vivo AI Lab introduce the ImageBind-LLM, a multimodality instruction-following model. This model effectively fine-tunes LLaMA using the joint embedding space in the pre-trained ImageBind. ImageBind-LLM can respond to instructions of various modalities, including pictures, unlike previous visual instruction models. It demonstrates promising extensibility and generalization capacity.
The researchers proposed using vision-language data exclusively for tweaking multimodality instructions, thanks to ImageBind’s image-aligned multimodality embedding space. Here’s how it works: for a picture-caption pair, they first extract the global image feature using ImageBind’s frozen image encoder. Then, they transform this picture feature using a learnable bind network. This converted picture feature is then applied to all word tokens in LLaMA, providing the visual context needed for generating the appropriate textual caption.
Unlike the zero-initialized attention in the LLaMA-Adapter series, their visual injection mechanism is simple and weighted by a trainable zero-initialized gating factor. This method allows the gradual introduction of ImageBind’s multimodality embeddings into LLaMA without interfering with the original language understanding.
With the use of ImageBind for modality-specific encodings, such as text, picture, audio, and video, the ImageBind-LLM becomes competent in obeying instructions of different modalities after basic vision-language training. The researchers also utilize the pre-trained 3D encoder in Point-Bind to handle instructions in 3D domains.
To bridge the modality gap between image training and text, audio, 3D, or video-conditioned production, the researchers provide a training-free visual cache approach for embedding augmentation during inference. This approach enhances the quality of verbal replies to multimodal instructions by obtaining comparable visual characteristics from a cache model comprising millions of picture features retrieved by ImageBind.
The ImageBind-LLM outperformed earlier models in various circumstances, proving its multimodality instruction-following capabilities. The researchers summarized the qualities of ImageBind-LLM as follows:
1. Instructions with many modes: ImageBind-LLM can respond to inputs of various modalities, such as image, text, audio, 3D point clouds, and video.
2. Efficiency tuning: The researchers freeze ImageBind’s image encoder during training and utilize parameter-efficient approaches like LoRA and bias-norm tuning to adjust partial weights in LLaMA. They also train the zero-initialized gating factors and the extra bind network.
3. Zero-initialized injection without attention: The researchers use a learnable gating method for progressive knowledge injection, directly incorporating multimodality requirements with all word tokens in LLaMA without additional instruction signals through attention layers.
4. Retrieval from a cross-modal cache: They introduce a visual cache model that performs cross-modality retrieval for embedding augmentation. This addresses the modality disparity between training (single picture) and inference (multiple modalities).
Interested in learning more about the ImageBind-LLM? Check out the paper and Github. Credit for this research goes to the dedicated researchers working on this project.
Don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter. Stay up to date with the latest AI research news, cool AI projects, and more. If you enjoy our work, you’ll love our newsletter.
About the Author:
Aneesh Tickoo, a consulting intern at MarktechPost, is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He dedicates most of his time to working on projects that harness the power of machine learning, with a specific interest in image processing. He loves collaborating with people on exciting projects.