Large language models (LLM) have recently made impressive advancements in natural language processing tasks. These models, such as ChatGPT, Claude, Bard, text-only GPT-4, and community open source projects like LLama, Alpaca, Vicuna, ChatGLM, MOSS, etc., have the potential to be used in general-purpose artificial intelligence models. This has led to the development of multimodal models that use LLM as a universal interface, aligning the feature space of a specific task with pre-trained language models.
One area of focus is vision-and-language models, like MiniGPT-4, LLaVA, LLaMA-Adapter, InstructBLIP, etc., which align the vision encoder to LLM by tuning on image-text pairings. However, these models are limited in their comprehension of more complex tasks like region captioning and reasoning due to their region-level alignment. Some studies use external vision models like MM-REACT, InternGPT, and DetGPT to improve region-level comprehension in vision-language models, but their design is not suitable for all-purpose multimodal models.
To address these limitations, the aim is to develop a vision-language model that provides fine-grained comprehension of regions of interest. This model uses spatial instructions and linguistic instruction to guide LLM in understanding visual elements. Two flexible implementation methods, RoIAlign and Deformable attention, are used for spatial instruction. Training datasets like COCO object identification, RefCOCO, RefCOCO+, RefCOCOg, Flickr30K entities, Visual Genome (VG), and Visual Commonsense Reasoning (VCR) are modified for instruction tweaking. Additionally, object detectors can be used to extract object boxes for spatial instruction.
The model is enhanced by pre-training the region feature extractor using short-text data that includes basic information, and fine-tuning it with longer texts that require logical thinking. This approach allows for end-to-end fine-tuning of the area feature extractor and LLM, enabling a unique interactive experience for users.
In conclusion, this work advances regional-level vision-language models by training LLM on regional text datasets. The model has additional capabilities like region captioning and reasoning. By introducing spatial instruction, the model can refer to regions of interest and provide more accurate responses. The coding, dataset instructions, and an online demo are available on GitHub.
To learn more about the vision-language model called GPT4RoI, check out the paper and GitHub link. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news. If you have any questions or feedback, feel free to reach out to us at Asif@marktechpost.com.
Aneesh Tickoo, a consulting intern at MarktechPost, is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He specializes in image processing and enjoys collaborating on interesting projects.