TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Recursive Visual Attention in Visual Dialog

Recursive Visual Attention in Visual Dialog

Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, Ji-Rong Wen

2018-12-06CVPR 2019 6Question AnsweringVisual DialogVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image. It typically needs to address two major problems: (1) How to answer visually-grounded questions, which is the core challenge in visual question answering (VQA); (2) How to infer the co-reference between questions and the dialog history. An example of visual co-reference is: pronouns (\eg, ``they'') in the question (\eg, ``Are they on or off?'') are linked with nouns (\eg, ``lamps'') appearing in the dialog history (\eg, ``How many lamps are there?'') and the object grounded in the image. In this work, to resolve the visual co-reference for visual dialog, we propose a novel attention mechanism called Recursive Visual Attention (RvA). Specifically, our dialog agent browses the dialog history until the agent has sufficient confidence in the visual co-reference resolution, and refines the visual attention recursively. The quantitative and qualitative experimental results on the large-scale VisDial v0.9 and v1.0 datasets demonstrate that the proposed RvA not only outperforms the state-of-the-art methods, but also achieves reasonable recursion and interpretable attention maps without additional annotations. The code is available at \url{https://github.com/yuleiniu/rva}.

Results

TaskDatasetMetricValueModel
DialogueVisDial v0.9 valMRR0.6634RVA
DialogueVisDial v0.9 valMean Rank3.93RVA
DialogueVisDial v0.9 valR@152.71RVA
DialogueVisDial v0.9 valR@1090.73RVA
DialogueVisDial v0.9 valR@582.97RVA
DialogueVisual Dialog v1.0 test-stdMRR (x 100)63.03RVA
DialogueVisual Dialog v1.0 test-stdMean4.18RVA
DialogueVisual Dialog v1.0 test-stdNDCG (x 100)55.59RVA
DialogueVisual Dialog v1.0 test-stdR@149.03RVA
DialogueVisual Dialog v1.0 test-stdR@1089.83RVA
DialogueVisual Dialog v1.0 test-stdR@580.4RVA
Visual DialogVisDial v0.9 valMRR0.6634RVA
Visual DialogVisDial v0.9 valMean Rank3.93RVA
Visual DialogVisDial v0.9 valR@152.71RVA
Visual DialogVisDial v0.9 valR@1090.73RVA
Visual DialogVisDial v0.9 valR@582.97RVA
Visual DialogVisual Dialog v1.0 test-stdMRR (x 100)63.03RVA
Visual DialogVisual Dialog v1.0 test-stdMean4.18RVA
Visual DialogVisual Dialog v1.0 test-stdNDCG (x 100)55.59RVA
Visual DialogVisual Dialog v1.0 test-stdR@149.03RVA
Visual DialogVisual Dialog v1.0 test-stdR@1089.83RVA
Visual DialogVisual Dialog v1.0 test-stdR@580.4RVA

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16