MIT researchers develop a multimodal framework for AI called the Compositional Foundation Models for Hierarchical Planning (HiP), which can help robots complete complex tasks. This model consists of three different foundation models that are trained on massive amounts of data and work together to make decisions and plans. Unlike other models, HiP does not need paired vision, language, and action data to function and makes the reasoning process more transparent.
The team at CSAIL tested HiP’s capabilities and found it to outperform other models on various manipulation tasks. Its hierarchical planning process involves a language model, a video diffusion model, and an egocentric action model that work together to pre-train and plan different sets of data, incorporating internet knowledge about the environment. The researchers demonstrated that using HiP is a cheap and effective way to train robots and it has real-world potential.
In the future, the team hopes to use HiP to solve real-world long-horizon tasks in robotics. The researchers also suggested that additional models to process touch and sound could further enhance HiP’s abilities. Their findings were presented at the 2023 Conference on Neural Information Processing Systems (NeurIPS).