Recursive Visual Attention in Visual Dialog

Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, Ji-Rong Wen

2018-12-06CVPR 2019 6Question Answering Visual Dialog Visual Question Answering (VQA)Visual Question Answering

Abstract

Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image. It typically needs to address two major problems: (1) How to answer visually-grounded questions, which is the core challenge in visual question answering (VQA); (2) How to infer the co-reference between questions and the dialog history. An example of visual co-reference is: pronouns (\eg, ``they'') in the question (\eg, ``Are they on or off?'') are linked with nouns (\eg, ``lamps'') appearing in the dialog history (\eg, ``How many lamps are there?'') and the object grounded in the image. In this work, to resolve the visual co-reference for visual dialog, we propose a novel attention mechanism called Recursive Visual Attention (RvA). Specifically, our dialog agent browses the dialog history until the agent has sufficient confidence in the visual co-reference resolution, and refines the visual attention recursively. The quantitative and qualitative experimental results on the large-scale VisDial v0.9 and v1.0 datasets demonstrate that the proposed RvA not only outperforms the state-of-the-art methods, but also achieves reasonable recursion and interpretable attention maps without additional annotations. The code is available at \url{https://github.com/yuleiniu/rva}.

Results

Task	Dataset	Metric	Value	Model
Dialogue	VisDial v0.9 val	MRR	0.6634	RVA
Dialogue	VisDial v0.9 val	Mean Rank	3.93	RVA
Dialogue	VisDial v0.9 val	R@1	52.71	RVA
Dialogue	VisDial v0.9 val	R@10	90.73	RVA
Dialogue	VisDial v0.9 val	R@5	82.97	RVA
Dialogue	Visual Dialog v1.0 test-std	MRR (x 100)	63.03	RVA
Dialogue	Visual Dialog v1.0 test-std	Mean	4.18	RVA
Dialogue	Visual Dialog v1.0 test-std	NDCG (x 100)	55.59	RVA
Dialogue	Visual Dialog v1.0 test-std	R@1	49.03	RVA
Dialogue	Visual Dialog v1.0 test-std	R@10	89.83	RVA
Dialogue	Visual Dialog v1.0 test-std	R@5	80.4	RVA
Visual Dialog	VisDial v0.9 val	MRR	0.6634	RVA
Visual Dialog	VisDial v0.9 val	Mean Rank	3.93	RVA
Visual Dialog	VisDial v0.9 val	R@1	52.71	RVA
Visual Dialog	VisDial v0.9 val	R@10	90.73	RVA
Visual Dialog	VisDial v0.9 val	R@5	82.97	RVA
Visual Dialog	Visual Dialog v1.0 test-std	MRR (x 100)	63.03	RVA
Visual Dialog	Visual Dialog v1.0 test-std	Mean	4.18	RVA
Visual Dialog	Visual Dialog v1.0 test-std	NDCG (x 100)	55.59	RVA
Visual Dialog	Visual Dialog v1.0 test-std	R@1	49.03	RVA
Visual Dialog	Visual Dialog v1.0 test-std	R@10	89.83	RVA
Visual Dialog	Visual Dialog v1.0 test-std	R@5	80.4	RVA

Recursive Visual Attention in Visual Dialog

Abstract

Results

Related Papers

Recursive Visual Attention in Visual Dialog

Abstract

Results

Related Papers