Introducing AVIS: Revolutionizing Visual Information Seeking with Large Language Models
There’s been amazing progress in adapting large language models (LLMs) to handle multimodal inputs for tasks like image captioning, visual question answering, and open vocabulary recognition. However, current visual language models (VLMs) struggle when it comes to visual information seeking tasks that require external knowledge. That’s where AVIS comes in.
AVIS, also known as Autonomous Visual Information Seeking with Large Language Models, is a groundbreaking method that achieves state-of-the-art results on visual information seeking tasks. It integrates LLMs with three types of tools: computer vision tools for extracting visual information from images, a web search tool for retrieving open world knowledge and facts, and an image search tool to gather relevant information from metadata associated with visually similar images.
Here’s a breakdown of how AVIS works:
1. LLM-powered planner: AVIS uses an LLM-powered planner to choose the right tools and queries at each step. The planner analyzes the available options and selects the most appropriate ones.
2. LLM-powered reasoner: AVIS employs an LLM-powered reasoner to analyze the outputs of the tools and extract key information. The reasoner determines whether the obtained information is sufficient or if more data is needed.
3. Working memory: AVIS incorporates a working memory component that retains information throughout the process. This allows AVIS to refer back to previous information and make more informed decisions.
By combining these components, AVIS is able to tackle complex visual information seeking tasks and achieve impressive results.
How Does AVIS Compare to Previous Methods?
Previous studies have explored adding tools to LLMs for multimodal inputs, but they often struggle with real-world scenarios. Other methods that apply LLMs as autonomous agents can fall into infinite loops or propagate errors due to the immense search space. AVIS overcomes these limitations by using guided LLM use based on human decisions from a user study.
Informing LLM Decision Making with a User Study
To understand human decision-making when using external tools, we conducted a user study. Participants were equipped with the same set of tools as AVIS and asked to answer challenging visual questions. We recorded their actions and outputs to guide our system.
The user study helped us construct a transition graph that defines distinct states and restricts the available actions at each state. We also used examples of human decision-making to enhance the performance and effectiveness of our system.
General Framework of AVIS
AVIS utilizes a dynamic decision-making strategy to respond to visual information-seeking queries. It consists of three main components:
1. Planner: The planner determines the next action based on the current state and the available tools. It refers to the transition graph to eliminate irrelevant actions and excludes actions that have already been taken.
2. Working memory: The working memory stores data collected from past tool interactions. It helps the planner make more informed decisions by referring back to previous information.
3. Reasoner: The reasoner analyzes the outputs of the tool execution and determines whether the obtained information is informative, uninformative, or the final answer.
By employing this framework, AVIS gradually answers the input query by making dynamic decisions and analyzing the tool outputs.
We evaluated AVIS on Infoseek and OK-VQA datasets. AVIS achieved high accuracy on both datasets, outperforming previous baselines. On Infoseek, AVIS achieved 50.7% accuracy on the unseen entity split, even without fine-tuning. On OK-VQA, AVIS achieved 60.2% accuracy with few-shot in-context examples.
AVIS is a game-changing approach that enables LLMs to use various tools for knowledge-intensive visual questions. Grounded in human decision-making data, AVIS employs a structured framework with an LLM-powered planner and reasoner to make dynamic decisions and provide accurate answers. With AVIS, visual information seeking just got a whole lot easier and more effective.