TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VQA-E: Explaining, Elaborating, and Enhancing Your Answers...

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, Jiebo Luo

2018-03-20ECCV 2018 9Question AnsweringExplanatory Visual Question AnsweringMulti-Task LearningVisual Question Answering (VQA)Visual Question Answering
PaperPDF

Abstract

Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)GQA-REXBLEU-442.56VQAE
Visual Question Answering (VQA)GQA-REXCIDEr358.2VQAE
Visual Question Answering (VQA)GQA-REXGQA-test57.24VQAE
Visual Question Answering (VQA)GQA-REXGQA-val65.19VQAE
Visual Question Answering (VQA)GQA-REXGrounding31.29VQAE
Visual Question Answering (VQA)GQA-REXMETEOR34.51VQAE
Visual Question Answering (VQA)GQA-REXROUGE-L73.59VQAE
Visual Question Answering (VQA)GQA-REXSPICE40.39VQAE
Visual Question AnsweringGQA-REXBLEU-442.56VQAE
Visual Question AnsweringGQA-REXCIDEr358.2VQAE
Visual Question AnsweringGQA-REXGQA-test57.24VQAE
Visual Question AnsweringGQA-REXGQA-val65.19VQAE
Visual Question AnsweringGQA-REXGrounding31.29VQAE
Visual Question AnsweringGQA-REXMETEOR34.51VQAE
Visual Question AnsweringGQA-REXROUGE-L73.59VQAE
Visual Question AnsweringGQA-REXSPICE40.39VQAE
Explanatory Visual Question AnsweringGQA-REXBLEU-442.56VQAE
Explanatory Visual Question AnsweringGQA-REXCIDEr358.2VQAE
Explanatory Visual Question AnsweringGQA-REXGQA-test57.24VQAE
Explanatory Visual Question AnsweringGQA-REXGQA-val65.19VQAE
Explanatory Visual Question AnsweringGQA-REXGrounding31.29VQAE
Explanatory Visual Question AnsweringGQA-REXMETEOR34.51VQAE
Explanatory Visual Question AnsweringGQA-REXROUGE-L73.59VQAE
Explanatory Visual Question AnsweringGQA-REXSPICE40.39VQAE

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16