TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dual Attention Networks for Visual Reference Resolution in...

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Gi-Cheon Kang, Jaeseo Lim, Byoung-Tak Zhang

2019-02-25IJCNLP 2019 11Question AnsweringVisual GroundingAI AgentVisual DialogVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)Code

Abstract

Visual dialog (VisDial) is a task which requires an AI agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and exploit visually-grounded information. A problem called visual reference resolution involves these challenges, requiring the agent to resolve ambiguous references in a given question and find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution. DAN consists of two kinds of attention networks, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.

Results

TaskDatasetMetricValueModel
Facial Recognition and Modelling300W Split 2NME (inter-ocular)4.3DAN
DialogueVisDial v0.9 valMRR66.38DAN
DialogueVisDial v0.9 valMean Rank4.04DAN
DialogueVisDial v0.9 valR@153.33DAN
DialogueVisDial v0.9 valR@1090.38DAN
DialogueVisDial v0.9 valR@582.42DAN
DialogueVisual Dialog v1.0 test-stdMRR (x 100)63.2DAN
DialogueVisual Dialog v1.0 test-stdMean4.3DAN
DialogueVisual Dialog v1.0 test-stdNDCG (x 100)57.59DAN
DialogueVisual Dialog v1.0 test-stdR@149.63DAN
DialogueVisual Dialog v1.0 test-stdR@1089.35DAN
DialogueVisual Dialog v1.0 test-stdR@579.75DAN
Face Reconstruction300W Split 2NME (inter-ocular)4.3DAN
3D300W Split 2NME (inter-ocular)4.3DAN
3D Face Modelling300W Split 2NME (inter-ocular)4.3DAN
3D Face Reconstruction300W Split 2NME (inter-ocular)4.3DAN
Visual DialogVisDial v0.9 valMRR66.38DAN
Visual DialogVisDial v0.9 valMean Rank4.04DAN
Visual DialogVisDial v0.9 valR@153.33DAN
Visual DialogVisDial v0.9 valR@1090.38DAN
Visual DialogVisDial v0.9 valR@582.42DAN
Visual DialogVisual Dialog v1.0 test-stdMRR (x 100)63.2DAN
Visual DialogVisual Dialog v1.0 test-stdMean4.3DAN
Visual DialogVisual Dialog v1.0 test-stdNDCG (x 100)57.59DAN
Visual DialogVisual Dialog v1.0 test-stdR@149.63DAN
Visual DialogVisual Dialog v1.0 test-stdR@1089.35DAN
Visual DialogVisual Dialog v1.0 test-stdR@579.75DAN

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16