TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning to Reason: End-to-End Module Networks for Visual ...

Learning to Reason: End-to-End Module Networks for Visual Question Answering

Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko

2017-04-18ICCV 2017 10Visual DialogVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode

Abstract

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.

Results

TaskDatasetMetricValueModel
DialogueVisual Dialog v1.0 test-stdMRR (x 100)58.8NMN
DialogueVisual Dialog v1.0 test-stdMean4.4NMN
DialogueVisual Dialog v1.0 test-stdNDCG (x 100)58.1NMN
DialogueVisual Dialog v1.0 test-stdR@144.15NMN
DialogueVisual Dialog v1.0 test-stdR@1086.88NMN
DialogueVisual Dialog v1.0 test-stdR@576.88NMN
Visual Question Answering (VQA)VQA v2 test-devAccuracy64.9N2NMN (ResNet-152, policy search)
Visual DialogVisual Dialog v1.0 test-stdMRR (x 100)58.8NMN
Visual DialogVisual Dialog v1.0 test-stdMean4.4NMN
Visual DialogVisual Dialog v1.0 test-stdNDCG (x 100)58.1NMN
Visual DialogVisual Dialog v1.0 test-stdR@144.15NMN
Visual DialogVisual Dialog v1.0 test-stdR@1086.88NMN
Visual DialogVisual Dialog v1.0 test-stdR@576.88NMN

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling2025-07-08