TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DualVD: An Adaptive Dual Encoding Model for Deep Visual Un...

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Xiaoze Jiang, Jing Yu, Zengchang Qin, Yingying Zhuang, Xingxing Zhang, Yue Hu, Qi Wu

2019-11-17Question AnsweringVisual Dialogfeature selectionVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue involves multiple questions which cover a broad range of visual content that could be related to any objects, relationships or semantics. The key challenge in Visual Dialogue task is thus to learn a more comprehensive and semantic-rich image representation which may have adaptive attentions on the image for variant questions. In this research, we propose a novel model to depict an image from both visual and semantic perspectives. Specifically, the visual view helps capture the appearance-level information, including objects and their relationships, while the semantic view enables the agent to understand high-level visual semantics from the whole image to the local regions. Futhermore, on top of such multi-view image features, we propose a feature selection framework which is able to adaptively capture question-relevant information hierarchically in fine-grained level. The proposed method achieved state-of-the-art results on benchmark Visual Dialogue datasets. More importantly, we can tell which modality (visual or semantic) has more contribution in answering the current question by visualizing the gate values. It gives us insights in understanding of human cognition in Visual Dialogue.

Results

TaskDatasetMetricValueModel
DialogueVisDial v0.9 valMRR62.94DualVD
DialogueVisDial v0.9 valMean Rank4.17DualVD
DialogueVisDial v0.9 valR@148.64DualVD
DialogueVisDial v0.9 valR@1089.94DualVD
DialogueVisDial v0.9 valR@580.89DualVD
DialogueVisual Dialog v1.0 test-stdMRR (x 100)63.23DualVD
DialogueVisual Dialog v1.0 test-stdMean4.11DualVD
DialogueVisual Dialog v1.0 test-stdNDCG (x 100)56.32DualVD
DialogueVisual Dialog v1.0 test-stdR@149.25DualVD
DialogueVisual Dialog v1.0 test-stdR@1089.7DualVD
DialogueVisual Dialog v1.0 test-stdR@580.23DualVD
Visual DialogVisDial v0.9 valMRR62.94DualVD
Visual DialogVisDial v0.9 valMean Rank4.17DualVD
Visual DialogVisDial v0.9 valR@148.64DualVD
Visual DialogVisDial v0.9 valR@1089.94DualVD
Visual DialogVisDial v0.9 valR@580.89DualVD
Visual DialogVisual Dialog v1.0 test-stdMRR (x 100)63.23DualVD
Visual DialogVisual Dialog v1.0 test-stdMean4.11DualVD
Visual DialogVisual Dialog v1.0 test-stdNDCG (x 100)56.32DualVD
Visual DialogVisual Dialog v1.0 test-stdR@149.25DualVD
Visual DialogVisual Dialog v1.0 test-stdR@1089.7DualVD
Visual DialogVisual Dialog v1.0 test-stdR@580.23DualVD

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17mNARX+: A surrogate model for complex dynamical systems using manifold-NARX and automatic feature selection2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16