Combining multiple activities into one instruction enhances the ability of AI models to handle new tasks. This capability has contributed to the popularity of chatbots like ChatGPT 2. Recently, visual encoders such as CLIP-ViT have been added to conversation agents, enabling human-agent interaction based on images. However, these models struggle to comprehend text within images, and this may be due to the training data being mostly natural imagery. Reading comprehension is crucial for humans in their daily visual perception. Thankfully, OCR techniques can be used to recognize words from photos.
To address this issue, researchers propose a method that involves adding recognized texts to the input of visual instruction-tuned models. They collected 422K noisy instruction-following data by combining manually given directions with OCR results from text-rich images. By doing this, they significantly improved the alignment between the language decoder and the visual features, which is essential for effective instruction-following.
In their experiments, the researchers developed LLaVAR, which stands for Large Language and Vision Assistant that Can Read. They scaled the input resolution from 2242 to 3362 compared to the original LLaVA to better capture minute textual features. The assessment results on various text-based VQA datasets and ScienceQA finetuning outcomes showed the effectiveness of their approach. They also conducted qualitative analysis using different types of images, such as posters, website screenshots, and tweets, to evaluate the model’s ability to follow instructions in more complex scenarios.
The contributions of this research include gathering 16K high-quality and 422K noisy instruction-following data, which improve visual instruction tuning. The researchers also made the training and assessment data, as well as the model milestones, publicly available.
In summary, the combination of OCR techniques and visual instruction-tuned models like LLaVAR enables AI models to read and comprehend text inside images. This advancement enhances the model’s performance on diverse online material, including text and images, while also improving its ability to follow instructions. The researchers have provided valuable datasets and resources to the AI community, supporting further advancements in this field.