The Significance of Multimodal-CoT Reasoning in AI
Recent technological advancements have led to the impressive performance of large language models (LLMs) in complex reasoning tasks. To achieve this, researchers have developed a technique called chain-of-thought (CoT) prompting, which involves generating intermediate reasoning steps for demonstrations. However, most of the existing work in CoT focuses only on language modality. To extract CoT reasoning in multiple modalities, researchers use the Multimodal-CoT paradigm.
Multimodal-CoT breaks down multi-step problems into intermediate reasoning processes, even when the inputs are in different modalities like vision and language. One common method is to convert the input from multiple modalities into a single modality before prompting the LLMs to perform CoT. However, this approach has disadvantages, such as significant information loss during the conversion process.
Another approach to achieve CoT reasoning in multiple modalities is to fine-tune small language models by combining different features of vision and language. However, this method is prone to producing hallucinatory reasoning patterns that impact answer inference.
To address this issue, Amazon researchers proposed a Multimodal-CoT approach that incorporates visual features into a decoupled training framework. This framework divides the reasoning process into two phases: rationale generation and answer inference. By including vision elements in both stages, the model generates more convincing arguments, resulting in more accurate answer inferences.
An evaluation of the proposed method on the ScienceQA benchmark, a dataset of multimodal science questions, showed that it outperforms the previous state-of-the-art model, GPT-3.5, by 16%. This research by Amazon is the first of its kind to study CoT reasoning in different modalities and demonstrates impressive performance.
The Multimodal-answer CoT uses the same model architecture for inference and reasoning-generating stages but differs in input and output. During the rationale generation stage, a vision-language model receives data from both visual and language domains. The produced rationale is then combined with the initial language input in the answer inference step. This updated data is used to train the model to generate the desired result. A transformer-based model, which performs encoding, interaction, and decoding, forms the basis of this framework.
In summary, Amazon researchers have tackled the challenge of eliciting Multimodal-CoT reasoning by proposing a two-stage framework that involves fine-tuning language models to combine vision and language representations. This approach generates informative rationales to facilitate accurate answer inference. The GitHub repository for this model is available for further exploration.
Check out the Paper and Github for more information about this research.