TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola

2023-02-02HallucinationScience Question AnsweringLanguage Modelling
PaperPDFCodeCode(official)Code

Abstract

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Results

TaskDatasetMetricValueModel
Question AnsweringScienceQAAvg. Accuracy91.68Multimodal CoT
Question AnsweringScienceQAGrades 1-692.44Multimodal CoT
Question AnsweringScienceQAGrades 7-1290.31Multimodal CoT
Question AnsweringScienceQAImage Context88.8Multimodal CoT
Question AnsweringScienceQALanguage Science90.82Multimodal CoT
Question AnsweringScienceQANatural Science95.91Multimodal CoT
Question AnsweringScienceQANo Context92.89Multimodal CoT
Question AnsweringScienceQASocial Science82Multimodal CoT
Question AnsweringScienceQAText Context95.26Multimodal CoT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16