Introducing Gato: A Versatile AI Agent
Gato is an AI agent that goes beyond just text outputs. It is a multi-modal, multi-task, multi-embodiment generalist policy agent. With the same network and weights, Gato can perform a range of tasks, including playing Atari games, captioning images, engaging in chat, controlling a robot arm, and more. It decides which output to produce based on its context, whether it be text, joint torques, button presses, or other tokens.
Training Phase of Gato
During the training phase, data from different tasks and modalities are serialized into a flat sequence of tokens. These tokens are then batched and processed by a transformer neural network, similar to a large language model. Gato only predicts action and text targets with a masked loss function.
When deploying Gato, a prompt or demonstration is tokenized to form the initial sequence. The environment provides the first observation, which is also tokenized and added to the sequence. Gato then samples the action vector one token at a time, based on its context. This process continues until the action vector is fully determined. The model always considers previous observations and actions within a context window of 1024 tokens.
Gato’s Training Datasets
Gato is trained on a diverse range of datasets, including agent experiences in simulated and real-world environments, as well as natural language and image datasets. The performance of the pretrained Gato model surpasses expert scores in various domains, as depicted in the bar plot.
The pretrained Gato model with the same weights can perform various tasks, such as image captioning, interactive dialogue, and robot arm control. These tasks demonstrate the versatility and capabilities of Gato as an AI agent.