Yunshui Li, Binyuan Hui, Zhichao Yin, Min Yang, Fei Huang, Yongbin Li
Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Dialogue | MMConv | Categorical Accuracy | 92.2 | PaCE |
| Dialogue | MMConv | Non-Categorical Accuracy | 43.4 | PaCE |
| Dialogue | MMConv | Overall | 39.2 | PaCE |
| Dialogue | SIMMC2.0 | Act F1 | 97.1 | PaCE |
| Dialogue | SIMMC2.0 | Slot F1 | 87 | PaCE |
| Reading Comprehension | PhotoChat | F1 | 63.8 | PaCE |
| Reading Comprehension | PhotoChat | Precision | 63.3 | PaCE |
| Reading Comprehension | PhotoChat | Recall | 68 | PaCE |
| Reading Comprehension | MMDialog | F1 | 77.6 | PaCE |
| Image Retrieval | PhotoChat | R1 | 15.2 | PaCE |
| Image Retrieval | PhotoChat | R@10 | 49.6 | PaCE |
| Image Retrieval | PhotoChat | R@5 | 36.7 | PaCE |
| Image Retrieval | PhotoChat | Sum(R@1,5,10) | 101.5 | PaCE |
| Response Generation | MMConv | BLEU | 22 | PaCE |
| Response Generation | MMConv | Comb. | 44.7 | PaCE |
| Response Generation | MMConv | Inform | 34.5 | PaCE |
| Response Generation | MMConv | Success | 13.9 | PaCE |
| Response Generation | SIMMC2.0 | BLEU | 34.1 | PaCE |
| Retrieval | Image-Chat | R@1 | 51.9 | PaCE |
| Retrieval | Image-Chat | R@5 | 76.8 | PaCE |
| Retrieval | Image-Chat | Sum(R@1,5) | 128.7 | PaCE |
| Intent Recognition | PhotoChat | F1 | 63.8 | PaCE |
| Intent Recognition | PhotoChat | Precision | 63.3 | PaCE |
| Intent Recognition | PhotoChat | Recall | 68 | PaCE |
| Intent Recognition | MMDialog | F1 | 77.6 | PaCE |