Title: Introducing Robotic Transformer 2: Advancing AI-powered Robotic Control
Introduction:
Robotic Transformer 2 (RT-2) is a revolutionary vision-language-action (VLA) model that combines knowledge from web and robotics data to enable robots to understand and respond to user instructions. This article introduces RT-2 and highlights its features and capabilities in the field of AI-powered robotic control.
Building on the Success of RT-1:
RT-2 is a visual-language-action model that evolved from the previous model, Robotic Transformer 1 (RT-1). RT-1 was trained on multi-task demonstrations from 13 robots over 17 months in an office kitchen environment. By leveraging RT-1’s capabilities, RT-2 demonstrates improved generalization, semantic understanding, and visual comprehension, even beyond the robotic data it was exposed to.
Adapting Vision-Language Models:
RT-2 builds upon high-capacity vision-language models (VLMs) that are trained on vast web-scale datasets. These models excel at recognizing visual and language patterns and operating across languages. By adapting the Pathways Language and Image model (PaLI-X) and Pathways Language model Embodied (PaLM-E) as the backbone of RT-2, we enable the model to comprehend and respond to both visual and linguistic inputs.
Representing Actions as Tokens:
To train a robot to perform actions, we represent actions as tokens in the model’s output. By converting actions into strings that can be processed by natural language tokenizers, we enable seamless integration of robot actions into the model’s training process. This representation allows the model to control a robot’s position, rotation, and gripper extension based on input commands.
Emergent Skills and Generalization:
RT-2 exhibits remarkable emergent skills that surpass previous models like RT-1 and Visual Cortex (VC-1). It combines web-based knowledge with its robotic data training to perform tasks that involve symbol understanding, reasoning, and human recognition. The model showcases its ability to handle previously unseen objects, environments, and backgrounds, making it highly adaptable and versatile.
Chain-of-Thought Reasoning:
Inspired by language models, we enhanced RT-2 by combining robotic control with chain-of-thought reasoning. This approach enables the model to plan long-horizon skill sequences and predict robot actions. With this capability, RT-2 can handle complex commands that require reasoning about intermediate steps, opening up possibilities for more advanced tasks.
Advancing Robotic Control:
RT-2 demonstrates the transformation of vision-language models (VLMs) into vision-language-action (VLA) models, enabling direct robot control. By combining VLM pre-training with robotic data, RT-2 achieves highly improved robotic policies, better generalization, and emergent capabilities. This breakthrough paves the way for the development of general-purpose physical robots that can reason, problem solve, and perform diverse tasks in real-world scenarios.
Conclusion:
Robotic Transformer 2 (RT-2) represents a significant advancement in AI-powered robotic control. By leveraging web and robotics data, RT-2 showcases remarkable generalization, emergent skills, and the ability to reason and problem solve. This model opens up exciting possibilities for the future of robotics and showcases the potential for building versatile robots capable of performing complex tasks in real-world environments.