PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

Yunshui Li, Binyuan Hui, Zhichao Yin, Min Yang, Fei Huang, Yongbin Li

2023-05-24Visual Dialog Dialogue State Tracking Text Retrieval Multimodal Intent Recognition Response Generation Image Retrieval

Paper PDF Code(official)

Abstract

Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.

Results

Task	Dataset	Metric	Value	Model
Dialogue	MMConv	Categorical Accuracy	92.2	PaCE
Dialogue	MMConv	Non-Categorical Accuracy	43.4	PaCE
Dialogue	MMConv	Overall	39.2	PaCE
Dialogue	SIMMC2.0	Act F1	97.1	PaCE
Dialogue	SIMMC2.0	Slot F1	87	PaCE
Reading Comprehension	PhotoChat	F1	63.8	PaCE
Reading Comprehension	PhotoChat	Precision	63.3	PaCE
Reading Comprehension	PhotoChat	Recall	68	PaCE
Reading Comprehension	MMDialog	F1	77.6	PaCE
Image Retrieval	PhotoChat	R1	15.2	PaCE
Image Retrieval	PhotoChat	R@10	49.6	PaCE
Image Retrieval	PhotoChat	R@5	36.7	PaCE
Image Retrieval	PhotoChat	Sum(R@1,5,10)	101.5	PaCE
Response Generation	MMConv	BLEU	22	PaCE
Response Generation	MMConv	Comb.	44.7	PaCE
Response Generation	MMConv	Inform	34.5	PaCE
Response Generation	MMConv	Success	13.9	PaCE
Response Generation	SIMMC2.0	BLEU	34.1	PaCE
Retrieval	Image-Chat	R@1	51.9	PaCE
Retrieval	Image-Chat	R@5	76.8	PaCE
Retrieval	Image-Chat	Sum(R@1,5)	128.7	PaCE
Intent Recognition	PhotoChat	F1	63.8	PaCE
Intent Recognition	PhotoChat	Precision	63.3	PaCE
Intent Recognition	PhotoChat	Recall	68	PaCE
Intent Recognition	MMDialog	F1	77.6	PaCE

PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

Abstract

Results

Related Papers

PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

Abstract

Results

Related Papers