Home AI News KITE Framework: Enabling Fine-Grained Semantic Manipulation for AI Robots

KITE Framework: Enabling Fine-Grained Semantic Manipulation for AI Robots

KITE Framework: Enabling Fine-Grained Semantic Manipulation for AI Robots

AI Robots: The Integration of Artificial Intelligence and Robotics

Artificial Intelligence (AI) technology is progressively merging with robotics, resulting in the development of more meaningful and effective solutions. AI robots, which are machines that operate in the real world, have the potential to communicate with humans through language. However, two main challenges hinder robots from efficiently handling free-form language inputs.

The first challenge involves enabling a robot to reason about what it needs to manipulate based on the provided instructions. The second challenge relates to pick-and-place tasks that require careful discernment when handling objects. For example, robots need to differentiate between picking up teddy animals by their ears instead of their legs or soap bottles by their dispensers instead of their sides.

To address these challenges, researchers from Stanford University have introduced KITE (Keypoints + Instructions to Execution), a two-step framework for semantic manipulation. KITE takes into account both scene semantics and object semantics. Scene semantics involve discriminating between various objects in a visual scene, while object semantics precisely localize different portions within an object instance.

In the first phase of KITE, 2D picture key points are used to ground an input instruction in a visual context. This step provides a precise object-centric bias for subsequent action inference. By mapping the command to key points in the scene, the robot gains a precise understanding of the items and their relevant features. The second step of KITE involves executing a learned keypoint-conditioned skill based on the RGB-D scene observation. The robot utilizes these parameterized talents to carry out the given instruction, allowing for fine-grained manipulation and generalization to differences in scenes and objects.

To evaluate KITE’s performance, the team tested it in three real environments: high-precision coffee-making, semantic grasping, and long-horizon 6-DoF tabletop manipulation. KITE achieved a success rate of 71% in preparing coffee, 70% in semantic grasping, and 75% in instruction-following for tabletop manipulation. Compared to frameworks that use keypoint-based grounding or emphasize end-to-end visuomotor control, KITE outperformed them in terms of performance.

Despite having the same or fewer demonstrations during training, KITE achieved these results, showcasing its effectiveness and efficiency. The framework employs a CLIPort-style technique to map an image and a language phrase to a saliency heatmap and produce a key point. Skill waypoints are then outputted through the modified PointNet++ architecture, which accepts an input multi-view point cloud annotated with a key point. The combination of 2D key points and 3D point clouds allows KITE to precisely attend to visual features and plan accordingly.

In conclusion, the KITE framework presents a promising solution for enabling robots to interpret and follow natural language commands in the context of manipulation. By leveraging key points and instruction grounding, KITE achieves fine-grained semantic manipulation with high precision and generalization.

Check out the Paper and Project for more details. Join our 25k+ ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and cool AI projects. If you have any questions or suggestions, feel free to email us at Asif@marktechpost.com.

🚀 Check Out 100’s AI Tools in AI Tools Club

Source link


Please enter your comment!
Please enter your name here