TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PaCE: Unified Multi-modal Dialogue Pre-training with Progr...

PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

Yunshui Li, Binyuan Hui, Zhichao Yin, Min Yang, Fei Huang, Yongbin Li

2023-05-24Visual DialogDialogue State TrackingText RetrievalMultimodal Intent RecognitionResponse GenerationImage Retrieval
PaperPDFCode(official)

Abstract

Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.

Results

TaskDatasetMetricValueModel
DialogueMMConvCategorical Accuracy92.2PaCE
DialogueMMConvNon-Categorical Accuracy43.4PaCE
DialogueMMConvOverall39.2PaCE
DialogueSIMMC2.0Act F197.1PaCE
DialogueSIMMC2.0Slot F187PaCE
Reading ComprehensionPhotoChatF163.8PaCE
Reading ComprehensionPhotoChatPrecision63.3PaCE
Reading ComprehensionPhotoChatRecall68PaCE
Reading ComprehensionMMDialogF177.6PaCE
Image RetrievalPhotoChatR115.2PaCE
Image RetrievalPhotoChatR@1049.6PaCE
Image RetrievalPhotoChatR@536.7PaCE
Image RetrievalPhotoChatSum(R@1,5,10)101.5PaCE
Response GenerationMMConvBLEU22PaCE
Response GenerationMMConvComb.44.7PaCE
Response GenerationMMConvInform34.5PaCE
Response GenerationMMConvSuccess13.9PaCE
Response GenerationSIMMC2.0BLEU34.1PaCE
RetrievalImage-ChatR@151.9PaCE
RetrievalImage-ChatR@576.8PaCE
RetrievalImage-ChatSum(R@1,5)128.7PaCE
Intent RecognitionPhotoChatF163.8PaCE
Intent RecognitionPhotoChatPrecision63.3PaCE
Intent RecognitionPhotoChatRecall68PaCE
Intent RecognitionMMDialogF177.6PaCE

Related Papers

FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval2025-07-09Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval2025-07-08An analysis of vision-language models for fabric retrieval2025-07-07Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model2025-07-07