Humans interact with the world and each other, and for artificial intelligence (AI) to be useful, it needs to interact effectively with humans and their environment. That’s why we developed the Multimodal Interactive Agent (MIA), which combines visual perception, language comprehension and production, navigation, and manipulation to engage in meaningful interactions with humans.
Our approach builds on previous work by Abramson et al. (2020), which uses imitation learning to train AI agents. After training, MIA demonstrates some basic intelligent behavior that we aim to improve through human feedback. In this work, we focus on developing this intelligent behavioral prior, leaving feedback-based learning for future research.
To facilitate interactions between humans and agents, we created the Playhouse environment, a 3D virtual space with randomized rooms and various interactable objects. In the Playhouse, humans and agents can control virtual robots that move, manipulate objects, and communicate via text. This virtual environment allows for a wide range of dialogues, from simple instructions to creative play.
To collect examples of Playhouse interactions, we used language games where one player receives a prompt to propose a task to the other player. We also included free-form prompts to encourage improvisation. In total, we collected 2.94 years of real-time human interactions in the Playhouse, ensuring a diverse range of behaviors.
Our training strategy involves supervised prediction of human actions (behavioral cloning) and self-supervised learning. We found that using a hierarchical control strategy significantly improves agent performance. We also employed self-supervised learning to classify vision and language inputs. To evaluate agent performance, we asked humans to interact with the agents and provide feedback on successful instruction execution. MIA achieved a success rate of over 70% in human-rated interactions, representing 75% of human success rates.
Scaling up the dataset and model size showed noticeable improvements in performance. Although our dataset and multi-modal, multi-task training differ from the language domain, which benefits from large datasets, we still observed performance gains with increased data and model size.
We also investigated the efficiency of training with new objects and commands. With less than 12 hours of human interaction data, MIA reached ceiling performance in tasks involving new objects. Similarly, MIA achieved ceiling performance in tasks involving a new command with only 1 hour of human demonstrations.
MIA exhibits diverse and unexpected behaviors, such as tidying rooms, finding specific objects, and asking clarifying questions. These behaviors continue to inspire us. However, evaluating MIA’s open-ended behavior presents challenges that we will address in future research.
For more details on our work, refer to our paper.