TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-View Attention Network for Visual Dialog

Multi-View Attention Network for Visual Dialog

Sungjin Park, Taesun Whang, Yeochan Yoon, Heuiseok Lim

2020-04-29Visual Dialog
PaperPDFCode(official)

Abstract

Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, and image) is required. Specifically, it is necessary for an agent to 1) determine the semantic intent of question and 2) align question-relevant textual and visual contents among heterogeneous modality inputs. In this paper, we propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two complementary modules (i.e., Topic Aggregation and Context Matching), and builds multimodal representations through sequential alignment processes (i.e., Modality Alignment). Experimental results on VisDial v1.0 dataset show the effectiveness of our proposed model, which outperforms the previous state-of-the-art methods with respect to all evaluation metrics.

Results

TaskDatasetMetricValueModel
DialogueVisDial v0.9 valMRR0.6765MVAN
DialogueVisDial v0.9 valMean Rank3.73MVAN
DialogueVisDial v0.9 valR@154.65MVAN
DialogueVisDial v0.9 valR@1091.47MVAN
DialogueVisDial v0.9 valR@583.85MVAN
DialogueVisual Dialog v1.0 test-stdMRR (x 100)64.84MVAN
DialogueVisual Dialog v1.0 test-stdMean3.97MVAN
DialogueVisual Dialog v1.0 test-stdNDCG (x 100)59.37MVAN
DialogueVisual Dialog v1.0 test-stdR@151.45MVAN
DialogueVisual Dialog v1.0 test-stdR@1090.65MVAN
DialogueVisual Dialog v1.0 test-stdR@581.12MVAN
Visual DialogVisDial v0.9 valMRR0.6765MVAN
Visual DialogVisDial v0.9 valMean Rank3.73MVAN
Visual DialogVisDial v0.9 valR@154.65MVAN
Visual DialogVisDial v0.9 valR@1091.47MVAN
Visual DialogVisDial v0.9 valR@583.85MVAN
Visual DialogVisual Dialog v1.0 test-stdMRR (x 100)64.84MVAN
Visual DialogVisual Dialog v1.0 test-stdMean3.97MVAN
Visual DialogVisual Dialog v1.0 test-stdNDCG (x 100)59.37MVAN
Visual DialogVisual Dialog v1.0 test-stdR@151.45MVAN
Visual DialogVisual Dialog v1.0 test-stdR@1090.65MVAN
Visual DialogVisual Dialog v1.0 test-stdR@581.12MVAN

Related Papers

V$^2$Dial: Unification of Video and Visual Dialog via Multimodal Experts2025-03-03V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts2025-01-01Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations2024-08-13ICCV23 Visual-Dialog Emotion Explanation Challenge: SEU_309 Team Technical Report2024-07-13Hawk: Learning to Understand Open-World Video Anomalies2024-05-27Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models2024-03-27FlexCap: Describe Anything in Images in Controllable Detail2024-03-18$\mathbb{VD}$-$\mathbb{GR}$: Boosting $\mathbb{V}$isual $\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal $\mathbb{GR}$aphs2023-10-25