TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Visual Dialog

Visual Dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra

2016-11-26CVPR 2017 7AI AgentVisual DialogChatbotRetrieval
PaperPDFCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCode

Abstract

We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and contains 1 dialog with 10 question-answer pairs on ~120k images from COCO, with a total of ~1.2M dialog question-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders -- Late Fusion, Hierarchical Recurrent Encoder and Memory Network -- and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify gap between machine and human performance on the Visual Dialog task via human studies. Putting it all together, we demonstrate the first 'visual chatbot'! Our dataset, code, trained models and visual chatbot are available on https://visualdialog.org

Results

TaskDatasetMetricValueModel
DialogueVisDial v0.9 valMRR0.5965MN-QIH-D
DialogueVisDial v0.9 valMean Rank5.46MN-QIH-D
DialogueVisDial v0.9 valR@145.55MN-QIH-D
DialogueVisDial v0.9 valR@1085.37MN-QIH-D
DialogueVisDial v0.9 valR@576.22MN-QIH-D
DialogueVisDial v0.9 valMRR0.5846HRE-QIH-D
DialogueVisDial v0.9 valMean Rank5.72HRE-QIH-D
DialogueVisDial v0.9 valR@144.67HRE-QIH-D
DialogueVisDial v0.9 valR@1084.22HRE-QIH-D
DialogueVisDial v0.9 valR@574.5HRE-QIH-D
DialogueVisDial v0.9 valMRR0.5807HRE-QIH-D
DialogueVisDial v0.9 valMean Rank5.78HRE-QIH-D
DialogueVisDial v0.9 valR@143.82HRE-QIH-D
DialogueVisDial v0.9 valR@1084.07HRE-QIH-D
DialogueVisDial v0.9 valR@574.68HRE-QIH-D
DialogueVisual Dialog v1.0 test-stdMRR (x 100)55.5MN-QIH-D
DialogueVisual Dialog v1.0 test-stdMean5.92MN-QIH-D
DialogueVisual Dialog v1.0 test-stdNDCG (x 100)47.5MN-QIH-D
DialogueVisual Dialog v1.0 test-stdR@140.98MN-QIH-D
DialogueVisual Dialog v1.0 test-stdR@1083.3MN-QIH-D
DialogueVisual Dialog v1.0 test-stdR@572.3MN-QIH-D
DialogueVisual Dialog v1.0 test-stdMRR (x 100)54.2HRE-QIH-D
DialogueVisual Dialog v1.0 test-stdMean6.41HRE-QIH-D
DialogueVisual Dialog v1.0 test-stdNDCG (x 100)45.5HRE-QIH-D
DialogueVisual Dialog v1.0 test-stdR@139.93HRE-QIH-D
DialogueVisual Dialog v1.0 test-stdR@1081.5HRE-QIH-D
DialogueVisual Dialog v1.0 test-stdR@570.45HRE-QIH-D
DialogueVisual Dialog v1.0 test-stdMRR (x 100)55.4MN-QIH-D
DialogueVisual Dialog v1.0 test-stdMean5.95MN-QIH-D
DialogueVisual Dialog v1.0 test-stdNDCG (x 100)45.3MN-QIH-D
DialogueVisual Dialog v1.0 test-stdR@140.95MN-QIH-D
DialogueVisual Dialog v1.0 test-stdR@1082.83MN-QIH-D
DialogueVisual Dialog v1.0 test-stdR@572.45MN-QIH-D
Visual DialogVisDial v0.9 valMRR0.5965MN-QIH-D
Visual DialogVisDial v0.9 valMean Rank5.46MN-QIH-D
Visual DialogVisDial v0.9 valR@145.55MN-QIH-D
Visual DialogVisDial v0.9 valR@1085.37MN-QIH-D
Visual DialogVisDial v0.9 valR@576.22MN-QIH-D
Visual DialogVisDial v0.9 valMRR0.5846HRE-QIH-D
Visual DialogVisDial v0.9 valMean Rank5.72HRE-QIH-D
Visual DialogVisDial v0.9 valR@144.67HRE-QIH-D
Visual DialogVisDial v0.9 valR@1084.22HRE-QIH-D
Visual DialogVisDial v0.9 valR@574.5HRE-QIH-D
Visual DialogVisDial v0.9 valMRR0.5807HRE-QIH-D
Visual DialogVisDial v0.9 valMean Rank5.78HRE-QIH-D
Visual DialogVisDial v0.9 valR@143.82HRE-QIH-D
Visual DialogVisDial v0.9 valR@1084.07HRE-QIH-D
Visual DialogVisDial v0.9 valR@574.68HRE-QIH-D
Visual DialogVisual Dialog v1.0 test-stdMRR (x 100)55.5MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdMean5.92MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdNDCG (x 100)47.5MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@140.98MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@1083.3MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@572.3MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdMRR (x 100)54.2HRE-QIH-D
Visual DialogVisual Dialog v1.0 test-stdMean6.41HRE-QIH-D
Visual DialogVisual Dialog v1.0 test-stdNDCG (x 100)45.5HRE-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@139.93HRE-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@1081.5HRE-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@570.45HRE-QIH-D
Visual DialogVisual Dialog v1.0 test-stdMRR (x 100)55.4MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdMean5.95MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdNDCG (x 100)45.3MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@140.95MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@1082.83MN-QIH-D
Visual DialogVisual Dialog v1.0 test-stdR@572.45MN-QIH-D

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15