CodeVQA: A Code Generation Approach for Few-Shot Visual Question Answering

AI News

CodeVQA: A Code Generation Approach for Few-Shot Visual Question Answering

Jimmy W.

July 7, 2023

CodeVQA: A Code Generation Approach for Few-Shot Visual Question Answering

Visual question answering (VQA) is a machine learning task that requires an AI model to answer questions about images. Traditionally, VQA methods rely on large amounts of labeled training data with human-annotated question-answer pairs. However, recent advancements in pre-training have allowed for the development of VQA methods that perform well with fewer training examples and without any human-annotated VQA data.

Despite these advancements, there is still a performance gap between these methods and fully supervised VQA methods. Few-shot methods, in particular, struggle with complex reasoning tasks such as spatial reasoning, counting, and multi-hop reasoning. Additionally, few-shot methods have primarily focused on answering questions about single images.

To address these limitations and improve accuracy on complex VQA tasks, we introduce CodeVQA in our paper “Modular Visual Question Answering via Code Generation.” CodeVQA is a framework that uses program synthesis to answer visual questions. It generates Python programs (code) with simple visual functions that allow it to process images and determine the answer.

In the few-shot setting, CodeVQA outperforms previous methods by approximately 3% on the COVR dataset and 2% on the GQA dataset. The CodeVQA approach utilizes a code-writing large language model (LLM) like PALM to generate Python programs. We guide the LLM by providing a prompt consisting of function descriptions and a small number of in-context visual questions paired with the associated Python code.

The CodeVQA framework can be instantiated using three visual functions: query, get_pos, and find_matching_image. Query answers questions about a single image and utilizes the few-shot Plug-and-Play VQA (PnP-VQA) method. Get_pos is an object localizer that uses GradCAM to locate objects in the image based on a description. Find_matching_image is used for multi-image questions and selects the image that best matches a given input phrase.

These functions can be implemented with minimal annotation, making CodeVQA versatile and easily adaptable to different visual tasks. It can be extended to include other functions such as object detection, image segmentation, or knowledge base retrieval.

We evaluate CodeVQA on three visual reasoning datasets: GQA, COVR, and NLVR2. The results consistently show that CodeVQA improves over the baseline few-shot VQA method on all three datasets.

In conclusion, CodeVQA is a powerful framework for few-shot visual question answering that uses code generation to perform multi-step visual reasoning. Future work includes expanding the set of modules used and extending the framework to other visual tasks. While deploying a system like CodeVQA should be done with caution due to potential biases, it offers interpretability and controllability compared to monolithic models.

Source link

LEAVE A REPLY Cancel reply