Robotic design and construction for daily tasks is an exciting and challenging field of computer science engineering. A team of researchers from MIT, NVIDIA, and Improbable AI Lab has successfully programmed a Frank Panda robotic arm with a Robotiq 2F140 parallel jaw gripper to rearrange objects in a scene and achieve a desired object placement. They have built a solution using an iterative training procedure to overcome the challenges of varying geometrical appearances and layouts in real-world scenes.
In real-world scenes, there are multiple possibilities for object placement, which makes programming, learning, and deployment difficult. To predict the rearrangement of objects, the researchers use a pose de-noising training procedure. They generate a noised point cloud from the final object-scene point cloud and train the model to transfer it to the initial configuration using neural networks. They also implement multi-step noising processes and diffusion models to handle the multi-modal outputs.
After the iterative de-noising process, the model needs to be able to generalize to novel scene layouts. The researchers propose locally encoding the scene point cloud by cropping a region near the object. This helps the model focus on the data set in the neighborhood and ignore distant distractors. By iteratively de-noising the 6-DoF pose of the object, the model can achieve the desired geometrical relationship with the scene point cloud.
The researchers use Relational Pose Diffusion (RPDiff) to perform relational rearrangement on real-world objects and scenes. The model is successful in tasks such as placing a book on a partially filled bookshelf, stacking a can on an open shelf, and hanging a mug on a rack with many hooks. However, the model has limitations when working with pre-trained representations of data.
Overall, the research team’s work on object rearrangement using RPDiff is an important contribution to the field of robotics and AI. It provides insights into solving the challenges of programming and learning in real-world scenes. Their work complements other teams’ work on object rearrangement from perception using Neural Shape Mating (NSM).