Home AI News Revolutionizing Visual Reasoning: The Potential of VLMs

Revolutionizing Visual Reasoning: The Potential of VLMs

Revolutionizing Visual Reasoning: The Potential of VLMs

Big Vision Language Models: A New Approach for Evidential Visual Reasoning

Developing Vision Language Models (VLMs) requires substantial amounts of training data and dedicated design considerations. A new study by Tsinghua University and Zhipu AI has explored the idea of using Chain of Manipulations (CoM), a generic mechanism that allows VLMs to execute evidential visual reasoning. This new approach aims to build general reasoning multimodal skills, by applying a sequence of manipulations to the visual input.

The new 17B VLM, CogCoM, is trained with a memory-based compatible architecture and a fusion of four categories of data based on the produced data, covering visual grounding, hallucination validation, and reasoning examination benchmarks. The outcomes demonstrate that the new approach consistently provides competitive or better performance. By combining the reasoning chains produced, CogCoM quickly reaches competitive performance with only a few training steps.

The new vision reasoning process may accelerate VLM development in the area of complicated visual problem-solving, and the data generation system that has been introduced has the potential to be used in various training scenarios, to help advance data-driven machine learning. In summary, the new approach has significant potential and exciting opportunities for the future of VLMs.

Don’t forget to check out the full paper and Github for more details and information. And if you liked what you read, make sure to follow us on Twitter and Google News for more updates!

Source link


Please enter your comment!
Please enter your name here