TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-Modal Open-Domain Dialogue

Multi-Modal Open-Domain Dialogue

Kurt Shuster, Eric Michael Smith, Da Ju, Jason Weston

2020-10-02EMNLP 2021 11Visual Dialog
PaperPDF

Abstract

Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling in both pre-training data and model size (Adiwardana et al., 2020; Roller et al., 2020). However, if we want to build agents with human-like abilities, we must expand beyond handling just text. A particularly important topic is the ability to see images and communicate about what is perceived. With the goal of engaging humans in multi-modal dialogue, we investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models. We study incorporating different image fusion schemes and domain-adaptive pre-training and fine-tuning strategies, and show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor (text-only) BlenderBot (Roller et al., 2020) in text-based conversation. We additionally investigate and incorporate safety components in our final model, and show that such efforts do not diminish model performance with respect to engagingness metrics.

Results

TaskDatasetMetricValueModel
DialogueBlendedSkillTalkBLEU-41Multi-Modal BlenderBot
DialogueBlendedSkillTalkF117.8Multi-Modal BlenderBot
DialogueBlendedSkillTalkROUGE-L19.3Multi-Modal BlenderBot
DialogueEmpatheticDialoguesBLEU-41.5Multi-Modal BlenderBot
DialogueEmpatheticDialoguesF119.2Multi-Modal BlenderBot
DialogueEmpatheticDialoguesROUGE-L24.5Multi-Modal BlenderBot
DialogueImage-ChatBLEU-440Multi-Modal BlenderBot
DialogueImage-ChatF113.1Multi-Modal BlenderBot
DialogueImage-ChatROUGE-L18Multi-Modal BlenderBot
DialogueConvAI2BLEU-41.1Multi-Modal BlenderBot
DialogueConvAI2F118.4Multi-Modal BlenderBot
DialogueConvAI2ROUGE-L22.6Multi-Modal BlenderBot
DialogueWizard of WikipediaBLEU-42.2Multi-Modal BlenderBot
DialogueWizard of WikipediaF118.6Multi-Modal BlenderBot
DialogueWizard of WikipediaROUGE-L17.4Multi-Modal BlenderBot
Visual DialogBlendedSkillTalkBLEU-41Multi-Modal BlenderBot
Visual DialogBlendedSkillTalkF117.8Multi-Modal BlenderBot
Visual DialogBlendedSkillTalkROUGE-L19.3Multi-Modal BlenderBot
Visual DialogEmpatheticDialoguesBLEU-41.5Multi-Modal BlenderBot
Visual DialogEmpatheticDialoguesF119.2Multi-Modal BlenderBot
Visual DialogEmpatheticDialoguesROUGE-L24.5Multi-Modal BlenderBot
Visual DialogImage-ChatBLEU-440Multi-Modal BlenderBot
Visual DialogImage-ChatF113.1Multi-Modal BlenderBot
Visual DialogImage-ChatROUGE-L18Multi-Modal BlenderBot
Visual DialogConvAI2BLEU-41.1Multi-Modal BlenderBot
Visual DialogConvAI2F118.4Multi-Modal BlenderBot
Visual DialogConvAI2ROUGE-L22.6Multi-Modal BlenderBot
Visual DialogWizard of WikipediaBLEU-42.2Multi-Modal BlenderBot
Visual DialogWizard of WikipediaF118.6Multi-Modal BlenderBot
Visual DialogWizard of WikipediaROUGE-L17.4Multi-Modal BlenderBot

Related Papers

V$^2$Dial: Unification of Video and Visual Dialog via Multimodal Experts2025-03-03V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts2025-01-01Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations2024-08-13ICCV23 Visual-Dialog Emotion Explanation Challenge: SEU_309 Team Technical Report2024-07-13Hawk: Learning to Understand Open-World Video Anomalies2024-05-27Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models2024-03-27FlexCap: Describe Anything in Images in Controllable Detail2024-03-18$\mathbb{VD}$-$\mathbb{GR}$: Boosting $\mathbb{V}$isual $\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal $\mathbb{GR}$aphs2023-10-25