TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learn to Explain: Multimodal Reasoning via Thought Chains ...

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan

2022-09-20Question AnsweringMultimodal ReasoningMultimodal Deep LearningScience Question AnsweringOpen-Domain Question AnsweringVisual Question Answering (VQA)Visual Commonsense ReasoningMultiple-choice
PaperPDFCode(official)

Abstract

When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.

Results

TaskDatasetMetricValueModel
Question AnsweringScienceQAAvg. Accuracy75.17GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQAGrades 1-678.23GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQAGrades 7-1269.68GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQAImage Context67.43GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQALanguage Science78.09GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQANatural Science75.44GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQANo Context79.93GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQASocial Science70.87GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQAText Context74.68GPT-3 - CoT (QCM→ALE , 2-shot)
Question AnsweringScienceQAAvg. Accuracy74.61GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQAGrades 1-678.49GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQAGrades 7-1267.63GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQAImage Context66.09GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQALanguage Science77.55GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQANatural Science76.6GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQANo Context79.58GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQASocial Science65.92GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQAText Context75.51GPT-3 - CoT(QCM→AE, 2-shot)
Question AnsweringScienceQAAvg. Accuracy74.11UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQAGrades 1-677.06UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQAGrades 7-1268.82UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQAImage Context66.53UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQALanguage Science78.91UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQANatural Science71UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQANo Context81.81UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQASocial Science76.04UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQAText Context66.42UnifiedQA-BASE - CoT (QCM→ALE)
Question AnsweringScienceQAAvg. Accuracy73.97GPT-3 (QCM→A, 2-shot)
Question AnsweringScienceQAGrades 1-676.8GPT-3 (QCM→A, 2-shot)
Question AnsweringScienceQAGrades 7-1268.89GPT-3 (QCM→A, 2-shot)
Question AnsweringScienceQAImage Context67.28GPT-3 (QCM→A, 2-shot)
Question AnsweringScienceQALanguage Science76GPT-3 (QCM→A, 2-shot)
Question AnsweringScienceQANatural Science74.64GPT-3 (QCM→A, 2-shot)
Question AnsweringScienceQANo Context77.42GPT-3 (QCM→A, 2-shot)
Question AnsweringScienceQASocial Science69.74GPT-3 (QCM→A, 2-shot)
Question AnsweringScienceQAText Context74.44GPT-3 (QCM→A, 2-shot)

Related Papers

EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17