Visual Language and AI: Enhancing Communication and Understanding Visual Information
Visual language, which relies on pictorial symbols to convey information, is a fundamental part of our digital and real-world communication. It includes icons, infographics, charts, and more. While AI has made significant progress in understanding natural images, visual language has received less attention due to the lack of large-scale training sets. However, new academic datasets have been created to evaluate question-answering systems on visual language images.
The Challenges in Answering Questions on Visual Language
Answering questions on visual language poses unique challenges that existing models are ill-prepared to handle. These challenges include reading the height of bars or the angle of slices in charts, understanding axis scales, mapping pictograms with their legend values, and performing numerical operations with extracted numbers.
Introducing MatCha: Enhancing Visual Language Pretraining
To address these challenges, we propose “MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering.” MatCha is a pixels-to-text foundation model trained on two tasks: chart derendering and math reasoning. In chart derendering, the model generates the underlying data table or rendering code from a chart image. In math reasoning, it decodes answers by rendering textual numerical reasoning datasets into images.
Introducing DePlot: One-shot Visual Language Reasoning
We also introduce “DePlot: One-shot visual language reasoning by plot-to-table translation,” which is built on top of MatCha. DePlot enables one-shot reasoning on charts by translating them into tables.
Chart De-rendering: Understanding Charts from Data Tables and Code
Charts are generated from underlying data tables and code. To understand a chart, one needs to parse and group visual patterns to extract key information. MatCha’s pre-training task involves recovering the source table or code from an image. To collect pre-training data, we accumulate [chart, code] and [chart, table] pairs from GitHub IPython notebooks and web-crawled sources.
Math Reasoning: Incorporating Numerical Reasoning into MatCha
We incorporate numerical reasoning skills into MatCha by pre-training it on two existing textual math reasoning datasets: MATH and DROP. MATH contains synthetically created questions, while DROP is a reading-comprehension QA dataset. By rendering the input text as images, MatCha learns to perform numerical computation and extract information.
End-to-End Results: Surpassing Previous State of the Art
We use MatCha’s pre-trained model on visual language tasks, such as question answering and summarization. MatCha outperforms previous models and achieves state-of-the-art performance on chart QA benchmarks and chart summarization tasks. It surpasses models with larger parameters on QA tasks and performs comparably on summarization.
Addressing Complex Reasoning with Derendering and Large Language Model Chains
For complex reasoning tasks, fine-tuned MatCha models may still struggle. To overcome this, we propose a two-step method: 1) a model reads the chart image and generates the corresponding table or code, and 2) a large language model performs complex reasoning on the generated table or code.
In conclusion, the development of MatCha and DePlot has greatly advanced the understanding and utilization of visual language. These models enhance scientific communication, accessibility, and data transparency. By improving AI’s ability to comprehend visual information, we can unlock new possibilities for various applications.