Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Gi-Cheon Kang, Jaeseo Lim, Byoung-Tak Zhang

2019-02-25IJCNLP 2019 11Question Answering Visual Grounding AI Agent Visual Dialog Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code(official)Code

Abstract

Visual dialog (VisDial) is a task which requires an AI agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and exploit visually-grounded information. A problem called visual reference resolution involves these challenges, requiring the agent to resolve ambiguous references in a given question and find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution. DAN consists of two kinds of attention networks, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.

Results

Task	Dataset	Metric	Value	Model
Facial Recognition and Modelling	300W Split 2	NME (inter-ocular)	4.3	DAN
Dialogue	VisDial v0.9 val	MRR	66.38	DAN
Dialogue	VisDial v0.9 val	Mean Rank	4.04	DAN
Dialogue	VisDial v0.9 val	R@1	53.33	DAN
Dialogue	VisDial v0.9 val	R@10	90.38	DAN
Dialogue	VisDial v0.9 val	R@5	82.42	DAN
Dialogue	Visual Dialog v1.0 test-std	MRR (x 100)	63.2	DAN
Dialogue	Visual Dialog v1.0 test-std	Mean	4.3	DAN
Dialogue	Visual Dialog v1.0 test-std	NDCG (x 100)	57.59	DAN
Dialogue	Visual Dialog v1.0 test-std	R@1	49.63	DAN
Dialogue	Visual Dialog v1.0 test-std	R@10	89.35	DAN
Dialogue	Visual Dialog v1.0 test-std	R@5	79.75	DAN
Face Reconstruction	300W Split 2	NME (inter-ocular)	4.3	DAN
3D	300W Split 2	NME (inter-ocular)	4.3	DAN
3D Face Modelling	300W Split 2	NME (inter-ocular)	4.3	DAN
3D Face Reconstruction	300W Split 2	NME (inter-ocular)	4.3	DAN
Visual Dialog	VisDial v0.9 val	MRR	66.38	DAN
Visual Dialog	VisDial v0.9 val	Mean Rank	4.04	DAN
Visual Dialog	VisDial v0.9 val	R@1	53.33	DAN
Visual Dialog	VisDial v0.9 val	R@10	90.38	DAN
Visual Dialog	VisDial v0.9 val	R@5	82.42	DAN
Visual Dialog	Visual Dialog v1.0 test-std	MRR (x 100)	63.2	DAN
Visual Dialog	Visual Dialog v1.0 test-std	Mean	4.3	DAN
Visual Dialog	Visual Dialog v1.0 test-std	NDCG (x 100)	57.59	DAN
Visual Dialog	Visual Dialog v1.0 test-std	R@1	49.63	DAN
Visual Dialog	Visual Dialog v1.0 test-std	R@10	89.35	DAN
Visual Dialog	Visual Dialog v1.0 test-std	R@5	79.75	DAN

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Abstract

Results

Related Papers

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Abstract

Results

Related Papers