Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan

2022-09-20Question Answering Multimodal Reasoning Multimodal Deep Learning Science Question Answering Open-Domain Question Answering Visual Question Answering (VQA)Visual Commonsense Reasoning Multiple-choice

Paper PDF Code(official)

Abstract

When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.

Results

Task	Dataset	Metric	Value	Model
Question Answering	ScienceQA	Avg. Accuracy	75.17	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Grades 1-6	78.23	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Grades 7-12	69.68	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Image Context	67.43	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Language Science	78.09	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Natural Science	75.44	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	No Context	79.93	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Social Science	70.87	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Text Context	74.68	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Avg. Accuracy	74.61	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Grades 1-6	78.49	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Grades 7-12	67.63	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Image Context	66.09	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Language Science	77.55	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Natural Science	76.6	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	No Context	79.58	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Social Science	65.92	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Text Context	75.51	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Avg. Accuracy	74.11	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Grades 1-6	77.06	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Grades 7-12	68.82	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Image Context	66.53	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Language Science	78.91	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Natural Science	71	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	No Context	81.81	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Social Science	76.04	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Text Context	66.42	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Avg. Accuracy	73.97	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Grades 1-6	76.8	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Grades 7-12	68.89	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Image Context	67.28	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Language Science	76	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Natural Science	74.64	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	No Context	77.42	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Social Science	69.74	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Text Context	74.44	GPT-3 (QCM→A, 2-shot)

Abstract

Results

Task	Dataset	Metric	Value	Model
Question Answering	ScienceQA	Avg. Accuracy	75.17	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Grades 1-6	78.23	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Grades 7-12	69.68	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Image Context	67.43	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Language Science	78.09	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Natural Science	75.44	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	No Context	79.93	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Social Science	70.87	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Text Context	74.68	GPT-3 - CoT (QCM→ALE , 2-shot)
Question Answering	ScienceQA	Avg. Accuracy	74.61	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Grades 1-6	78.49	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Grades 7-12	67.63	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Image Context	66.09	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Language Science	77.55	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Natural Science	76.6	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	No Context	79.58	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Social Science	65.92	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Text Context	75.51	GPT-3 - CoT(QCM→AE, 2-shot)
Question Answering	ScienceQA	Avg. Accuracy	74.11	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Grades 1-6	77.06	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Grades 7-12	68.82	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Image Context	66.53	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Language Science	78.91	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Natural Science	71	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	No Context	81.81	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Social Science	76.04	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Text Context	66.42	UnifiedQA-BASE - CoT (QCM→ALE)
Question Answering	ScienceQA	Avg. Accuracy	73.97	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Grades 1-6	76.8	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Grades 7-12	68.89	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Image Context	67.28	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Language Science	76	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Natural Science	74.64	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	No Context	77.42	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Social Science	69.74	GPT-3 (QCM→A, 2-shot)
Question Answering	ScienceQA	Text Context	74.44	GPT-3 (QCM→A, 2-shot)

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Abstract

Results

Related Papers

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Abstract

Results

Related Papers