Large Vision-Language Models (LVLMs): Enhancing Scientific Understanding

Large Language Models (LLMs) combined with powerful vision encoders create Large Vision-Language Models (LVLMs). Models like GPT-4 have shown exceptional skills in tasks involving real-world images, a major advancement in AI. These hybrid models display a unique blend of perceptive and cognitive abilities akin to human-like cognition, excelling in interpreting and interacting with real-world images.

Although LVLMs excel in many areas, they struggle with abstract concepts, especially in fields like physics and mathematics that require abstract reasoning. This limitation is due to a lack of exposure to specialized data, particularly abstract figures found in scientific literature. This gap affects the models’ ability to comprehend and reason with abstract scientific content.

To address this, researchers have introduced Multimodal ArXiv, a strategy to improve LVLMs’ understanding of scientific material. The project uses data from the arXiv repository, known for its scholarly preprints in various scientific disciplines. The project includes ArXivCap, a dataset containing real academic figures with captions sourced from over 572,000 publications. This dataset, including 6.4 million images and 3.9 million captions, aims to enhance LVLMs’ scientific reasoning abilities.

Additionally, a collection of 100,000 multiple-choice question-answer combinations have been created specifically for figures in ArXivCap. This feature, ArXivQA, helps improve LVLMs’ ability to reason scientifically. Evaluations show significant performance gains, demonstrating the effectiveness of domain-specific training in enhancing LVLM performance.

Despite improvements, LVLMs still face challenges in accurately interpreting and describing scientific figures. Manual evaluations reveal difficulties in visual understanding and caption production. These results guide future studies to further improve LVLMs’ understanding of scientific content.

