Article Title: AVIS: Empowering Language Models to Seek Visual Information
In recent years, researchers have made significant progress in adapting large language models (LLMs) to handle tasks that involve multimodal inputs like image captioning, visual question answering (VQA), and open vocabulary recognition. However, existing state-of-the-art visual language models (VLMs) struggle when it comes to answering visual information seeking queries that require external knowledge. To address this gap, Google Research introduces a groundbreaking method called AVIS (Autonomous Visual Information Seeking with Large Language Models), which achieves state-of-the-art results in visual information seeking tasks.
Integration of LLMs with Tools:
AVIS utilizes LLMs in combination with three types of tools:
1. Computer vision tools: These tools extract visual information from images.
2. Web search tool: This tool retrieves open world knowledge and facts.
3. Image search tool: This tool extracts relevant information from metadata associated with visually similar images.
AVIS utilizes an LLM-powered planner to choose tools and queries at each step, along with an LLM-powered reasoner to analyze tool outputs and extract key information. Additionally, a working memory component retains information throughout the process.
Comparison to Previous Work:
Previous studies have explored the integration of tools into LLMs for handling multimodal inputs. However, these approaches often struggle in complex scenarios. On the other hand, there has been an interest in using LLMs as autonomous agents that interact with the environment and adapt based on real-time feedback. However, these methods lack restrictions on tool invocation, leading to an overwhelming search space. AVIS addresses these challenges by incorporating human decision-making data from a user study to guide LLM decision making, resulting in more effective and efficient performance.
Informing LLM Decision Making with a User Study:
To understand human decision-making when using external tools, a user study was conducted. Users were provided with an identical set of tools as used in AVIS, including PALI, PaLM, and web search. The study recorded user actions and outputs, which were used as a guide for the AVIS system. The study resulted in a transition graph that defined distinct states and restricted the available set of actions at each state, helping the AVIS system make informed decisions.
AVIS operates through a dynamic decision-making strategy that responds to visual information-seeking queries. It consists of three primary components:
1. Planner: Determines the next action, including the appropriate API call and the query.
2. Working memory: Retains information from API executions.
3. Reasoner: Analyzes tool outputs and determines the sufficiency of information for a final response.
The planner refers to the transition graph to eliminate irrelevant actions and excludes actions stored in the working memory. In context with examples from the user study and the working memory data, the planner formulates a prompt, which is sent to the LLM for a structured answer. This iterative process continues until the final response is generated.
AVIS was evaluated on Infoseek and OK-VQA datasets, showcasing its superior performance compared to previous baselines. It achieved high accuracy without the need for fine-tuning and outperformed other models in visual question answering tasks.
AVIS presents an innovative approach that empowers LLMs to handle knowledge-intensive visual questions. By leveraging human decision-making data and employing a structured framework, AVIS demonstrates the effectiveness of combining LLMs with a variety of tools. This approach enhances the performance and efficiency of visual information seeking tasks, pushing the boundaries of AI capabilities.