RT-2: A Novel Vision-Language-Action Model for Robotic Control and Generalization

Introducing the Robotic Transformer 2 (RT-2): A Game-Changing Model for Robotic Control

The Robotic Transformer 2 (RT-2) is an innovative vision-language-action (VLA) model that combines web and robotics data to provide generalized instructions for robotic control. While high-capacity vision-language models (VLMs) excel at recognizing visual and language patterns, robots need firsthand robot data to achieve the same level of competence.

In our latest paper, we present the RT-2 model, which leverages both web and robotics data to translate knowledge into instructions for robotic control. This model builds upon the success of the Robotic Transformer 1 (RT-1), which was trained on multi-task demonstrations using data collected over 17 months in an office kitchen environment.

The RT-2 model exhibits improved generalization capabilities, semantic and visual understanding, and rudimentary reasoning. It can interpret new commands and respond to user instructions, such as reasoning about object categories or selecting the best type of drink for a tired person.

To achieve robotic control, we adapt VLMs like the Pathways Language and Image model (PaLI-X) and the Pathways Language model Embodied (PaLM-E) as the foundation of the RT-2 model. We represent actions as tokens in the model’s output, allowing the robot to be trained to output specific actions.

We conducted extensive experiments with over 6,000 robotic trials to assess the RT-2 model’s capabilities. These experiments demonstrated that the model excels in symbol understanding, reasoning, and human recognition. It successfully performs tasks based on visual-semantic concepts and robotic control, even when faced with previously unseen objects or scenarios.

The RT-2 model also outperforms previous baselines, including the RT-1 model and models pre-trained on visual-only tasks. It achieves high success rates on in-distribution tasks and outperforms competitors on unseen tasks.

Furthermore, we tested the RT-2 model in both simulation and the real world, showcasing its ability to generalize to novel objects and perform well in diverse environments.

One notable feature of the RT-2 model is its chain-of-thought reasoning capability. By combining robotic control with reasoning, the model can plan long-horizon skill sequences and predict robot actions. This empowers the model to handle complex user instructions that require reasoning about intermediate steps.

Overall, the Robotic Transformer 2 (RT-2) is a groundbreaking VLA model that transforms vision-language models into powerful tools for robotic control. It offers superior generalization performance, emergent capabilities, and the potential to build general-purpose robots capable of problem-solving and performing various real-world tasks.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...