Large Vision Language Models (LVLMs) have been a major focus in artificial intelligence research in the past year. These models show strong performance across different tasks when given different prompts. However, their potential to perceive images still has room for improvement.
Enhanced image perception is essential for advancing LVLM development and implementation. Current vision vocabulary networks and the high computational cost of optimizing numerous parameters are two primary challenges that hinder progress.
Popular LVLMs excel in tasks that involve Computer Vision (CV) and Natural Language Processing (NLP), such as image captioning, Visual Question Answering (VQA), meme understanding, and scene OCR. These models use a powerful vision vocabulary network like CLIP to achieve high performance. However, their architecture may limit the model’s potential based on the efficiency of their vision vocabulary network in encoding visual signals.
To address this, researchers have proposed a method to scale up the vision vocabulary for LVLMs by training a new visual vocabulary network using a smaller auto-regressive model, like OPT-125M, and merging it with the existing vocabulary. However, there are drawbacks such as wasted network capacity and high iteration costs with the current approach.
In response to these challenges, researchers at MEGVII Technology introduced Vary-toy, a smaller yet powerful LVLM with improved vocabulary creation processes. Vary-toy integrates object detection tasks into the vocabulary network, combining dense textual data (PDF) and natural object location data to enhance its universality. This approach has enabled Vary-toy to showcase its potential as a smaller yet powerful LVLM, achieving impressive results on various challenging benchmarks.
Vary-toy’s impressive results and compact size make it accessible for researchers with limited resources, serving as a practical baseline for LVLM research. The researchers plan to publicly release the code for further exploration and adoption within the research community. Check out the paper and project to learn more about Vary-toy. If you like our work, you’ll love our newsletter. Also, don’t forget to join our Telegram channel.