DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Xiaoze Jiang, Jing Yu, Zengchang Qin, Yingying Zhuang, Xingxing Zhang, Yue Hu, Qi Wu

2019-11-17Question Answering Visual Dialog feature selection Visual Question Answering (VQA)Visual Question Answering

Abstract

Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue involves multiple questions which cover a broad range of visual content that could be related to any objects, relationships or semantics. The key challenge in Visual Dialogue task is thus to learn a more comprehensive and semantic-rich image representation which may have adaptive attentions on the image for variant questions. In this research, we propose a novel model to depict an image from both visual and semantic perspectives. Specifically, the visual view helps capture the appearance-level information, including objects and their relationships, while the semantic view enables the agent to understand high-level visual semantics from the whole image to the local regions. Futhermore, on top of such multi-view image features, we propose a feature selection framework which is able to adaptively capture question-relevant information hierarchically in fine-grained level. The proposed method achieved state-of-the-art results on benchmark Visual Dialogue datasets. More importantly, we can tell which modality (visual or semantic) has more contribution in answering the current question by visualizing the gate values. It gives us insights in understanding of human cognition in Visual Dialogue.

Results

Task	Dataset	Metric	Value	Model
Dialogue	VisDial v0.9 val	MRR	62.94	DualVD
Dialogue	VisDial v0.9 val	Mean Rank	4.17	DualVD
Dialogue	VisDial v0.9 val	R@1	48.64	DualVD
Dialogue	VisDial v0.9 val	R@10	89.94	DualVD
Dialogue	VisDial v0.9 val	R@5	80.89	DualVD
Dialogue	Visual Dialog v1.0 test-std	MRR (x 100)	63.23	DualVD
Dialogue	Visual Dialog v1.0 test-std	Mean	4.11	DualVD
Dialogue	Visual Dialog v1.0 test-std	NDCG (x 100)	56.32	DualVD
Dialogue	Visual Dialog v1.0 test-std	R@1	49.25	DualVD
Dialogue	Visual Dialog v1.0 test-std	R@10	89.7	DualVD
Dialogue	Visual Dialog v1.0 test-std	R@5	80.23	DualVD
Visual Dialog	VisDial v0.9 val	MRR	62.94	DualVD
Visual Dialog	VisDial v0.9 val	Mean Rank	4.17	DualVD
Visual Dialog	VisDial v0.9 val	R@1	48.64	DualVD
Visual Dialog	VisDial v0.9 val	R@10	89.94	DualVD
Visual Dialog	VisDial v0.9 val	R@5	80.89	DualVD
Visual Dialog	Visual Dialog v1.0 test-std	MRR (x 100)	63.23	DualVD
Visual Dialog	Visual Dialog v1.0 test-std	Mean	4.11	DualVD
Visual Dialog	Visual Dialog v1.0 test-std	NDCG (x 100)	56.32	DualVD
Visual Dialog	Visual Dialog v1.0 test-std	R@1	49.25	DualVD
Visual Dialog	Visual Dialog v1.0 test-std	R@10	89.7	DualVD
Visual Dialog	Visual Dialog v1.0 test-std	R@5	80.23	DualVD

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Abstract

Results

Related Papers

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Abstract

Results

Related Papers