The Significance of Common Sense in AI Decision-Making
In our daily lives, we often make decisions based on common sense. But what about robots and artificial intelligence? Can they also make decisions based on common sense? This is where embodied agents with common sense come in. These agents are able to successfully follow human instructions by aligning language and perception models, resulting in executable plans.
Introducing TaPA: A Task Planning Agent for Embodied Tasks
A team of researchers from the Department of Automation and Beijing National Research Centre for Information Science and Technology have developed a TAsk Planning Agent (TaPA) for embodied tasks with physical scene constraints. These agents generate executable plans by aligning language and perception models with existing objects in a scene.
The researchers claim that TaPA can generate grounded plans without constraining task types and target objects. They created a multimodal dataset consisting of visual scenes, instructions, and corresponding plans, which they used to fine-tune the pre-trained LLaMA network. This network is then assigned as a task planner, allowing TaPA to generate executable actions step by step based on scene information and human instructions.
Generating the Multimodal Dataset
To generate the multimodal dataset, the researchers made use of vision-language models and large multimodal models. However, due to the lack of a large-scale multimodal dataset for training the planning agent, it was challenging to achieve embodied task planning grounded in realistic indoor scenes. The researchers addressed this challenge by using GPT-3.5 with a presented scene representation and design prompt to generate a large-scale multimodal dataset for tuning the planning agent.
The Achievements of TaPA Agents
According to the researchers, TaPA agents have achieved a higher success rate of generated action plans compared to state-of-the-art LLMs such as LLaMA and GPT-3.5, as well as large multimodal models like LLaVA. TaPA agents also demonstrate better understanding of input objects, with a significant decrease in hallucination cases compared to other models.
The researchers emphasize that the tasks in their collected multimodal datasets are much more complex than conventional benchmarks on instruction following tasks. These tasks require new methods for optimization and have longer implementation steps.
Conclusion
Embodied agents with common sense, like the TaPA agents developed by researchers, play a crucial role in AI decision-making. By aligning language and perception models, these agents can generate executable plans based on human instructions and scene information. The achievements of TaPA agents demonstrate their superiority over existing models in terms of success rate and understanding of input objects. This research opens up new possibilities for optimizing embodied task planning in complex scenarios.
For more information, you can read the full paper.
If you find this research interesting, make sure to follow us on Twitter for more updates.