CodeVQA: A Framework for Visual Question Answering
Visual question answering (VQA) is a machine learning task where a model has to answer questions about images. Traditional VQA methods require a large amount of labeled training data, but recent advancements in pre-training have led to the development of few-shot and zero-shot VQA methods. However, these methods still struggle with certain types of questions and have been limited to single images.
To address these challenges, we introduce CodeVQA, a framework that uses program synthesis to answer visual questions. CodeVQA generates a Python program that processes images and executes it to determine the answer. This approach improves accuracy on complex reasoning tasks.
How CodeVQA Works
CodeVQA uses a code-writing large language model (LLM), such as PALM, to generate Python programs. We guide the LLM by providing a prompt that describes visual functions and includes in-context examples of visual questions paired with their Python code. The LLM generates a program representing the input question.
The CodeVQA framework incorporates three visual functions: query, get_pos, and find_matching_image. Query answers questions about a single image using a few-shot VQA method. Get_pos locates objects in an image using GradCAM. Find_matching_image finds the image that best matches a given phrase using embeddings.
Evaluation of CodeVQA on visual reasoning datasets shows consistent improvement over the baseline few-shot VQA method. CodeVQA’s accuracy is approximately 30% higher on spatial reasoning questions, 4% higher on “and” questions, and 3% higher on “or” questions compared to the baseline.
In multi-image questions, CodeVQA performs better as the number of input images increases. Breaking down the problem into single-image questions is beneficial in such cases.
CodeVQA is a framework for few-shot visual question answering that utilizes code generation for multi-step visual reasoning. Future work involves expanding the set of modules and applying the framework to other visual tasks. While deploying systems like CodeVQA should be done with caution due to potential social biases, its interpretability and controllability make it useful in production systems.
This research was conducted by collaboration between UC Berkeley’s Artificial Intelligence Research lab (BAIR) and Google Research. The team includes Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, and Dan Klein.