Dynamic Memory Networks for Visual and Textual Question Answering

Caiming Xiong, Stephen Merity, Richard Socher

2016-03-04Question Answering Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code Code Code Code Code Code Code Code Code Code

Abstract

Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering. One such architecture, the dynamic memory network (DMN), obtained high accuracy on a variety of language tasks. However, it was not shown whether the architecture achieves strong results for question answering when supporting facts are not marked during training or whether it could be applied to other modalities such as images. Based on an analysis of the DMN, we propose several improvements to its memory and input modules. Together with these changes we introduce a novel input module for images in order to be able to answer visual questions. Our new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	VQA v1 test-std	Accuracy	60.4	DMN+
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 open ended	Percentage correct	60.4	DMN+ [xiong2016dynamic]
Visual Question Answering (VQA)	VQA v1 test-dev	Accuracy	60.3	DMN+

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16 MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16