When you want a cup of coffee, you can simply tell a robot to make it for you. Instead of giving the robot step-by-step instructions, like “Go to the kitchen, find the coffee machine, and switch it on,” you can just say “Make a cup of coffee.” However, current systems lack the ability to understand and reason about your intentions. To address this limitation, researchers at Microsoft Research, the University of Hong Kong, and SmartMore have proposed a new task called reasoning segmentation. This task involves designing a segmentation mask for a complex query text and evaluating it using a benchmark dataset.
The Reasoning Segmentation Task
In the reasoning segmentation task, the goal is to create a segmentation mask based on a complex and implicit query text. The researchers have developed an assistant called Language Instructed Segmentation Assistant (LISA), which is similar to Google Assistant and Siri. LISA is capable of producing segmentation tasks by leveraging the language generation capabilities of the multi-modal Large Language Model.
LISA’s Capabilities and Performance
LISA has the ability to handle complex reasoning, world knowledge, explanatory answers, and multi-conversations. The researchers have found that training LISA on reasoning-free datasets and fine-tuning it with a small set of reasoning segmentation image-instruction pairs significantly improves its performance. In fact, LISA demonstrates impressive zero-shot ability on the benchmark dataset, even without explicit reasoning segmentation training.
Comparison to Referring Segmentation
The reasoning segmentation task differs from referring segmentation, as it requires the model to possess reasoning ability and access to world knowledge. The model can only perform well if it completely understands the query. The researchers have shown that their method unlocks new reasoning segmentation capabilities, which are more effective compared to complex and standard reasoning.
According to the researchers, LISA achieves a performance boost of more than 20% in gIoU (Intersection-over-Union) compared to existing models. They have also found that a stronger multi-modal Large Language Model (LLM) can further improve the performance of LISA. Additionally, the researchers have demonstrated that their model is competent in vanilla referring segmentation tasks.
The future work of the researchers will focus on the importance of self-reasoning ability for building genuinely intelligent perception systems. They emphasize the need for establishing benchmarks to evaluate and encourage the development of new techniques.
Credit for this research goes to the researchers involved in the project. Don’t forget to join our ML SubReddit, Facebook Community, Discord Channel, and Email Newsletter to stay updated on the latest AI research news and projects.