TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Boosting the Power of Small Multimodal Reasoning Models to...

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Cheng Tan, Jingxuan Wei, Zhangyang Gao, Linzhuang Sun, Siyuan Li, Ruifeng Guo, Bihui Yu, Stan Z. Li

2023-11-23Multimodal ReasoningScience Question AnsweringVisual Question Answering (VQA)
PaperPDFCode(official)

Abstract

Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions. Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework, separating rationale generation from answer inference. However, these approaches often fall short due to the inadequate quality of the generated rationales. In this work, we delve into the importance of rationales in model reasoning. We observe that when rationales are completely accurate, the model's accuracy significantly improves, highlighting the need for high-quality rationale generation. Motivated by this, we propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process. This approach not only enhances the quality of generated rationales but also leads to more accurate and robust answers. Through extensive experiments, we demonstrate that our approach significantly improves model performance across various benchmarks. Remarkably, we show that even smaller base models, when equipped with our proposed approach, can achieve results comparable to those of larger models, illustrating the potential of our approach in harnessing the power of rationales for improved multimodal reasoning. The code is available at https://github.com/chengtan9907/mc-cot.

Results

TaskDatasetMetricValueModel
Question AnsweringScienceQAAvg. Accuracy94.88MC-CoT F-Large
Question AnsweringScienceQAGrades 1-695.3MC-CoT F-Large
Question AnsweringScienceQAGrades 7-1294.13MC-CoT F-Large
Question AnsweringScienceQAImage Context93.75MC-CoT F-Large
Question AnsweringScienceQALanguage Science93.18MC-CoT F-Large
Question AnsweringScienceQANatural Science97.47MC-CoT F-Large
Question AnsweringScienceQANo Context94.49MC-CoT F-Large
Question AnsweringScienceQASocial Science90.44MC-CoT F-Large
Question AnsweringScienceQAText Context96.97MC-CoT F-Large
Visual Question Answering (VQA)A-OKVQAMC Accuracy71MC-CoT

Related Papers

EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent2025-07-21Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs2025-07-10MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09