UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Wei Li, Can Gao, guocheng niu, Xinyan Xiao, Hao liu, Jiachen Liu, Hua Wu, Haifeng Wang

2020-12-31ACL 2021 5Cross-Modal Retrieval Image Captioning Contrastive Learning

Abstract

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks. Our code and pre-trained models are public at the UNIMO project page https://unimo-ptm.github.io/

Results

Task	Dataset	Metric	Value	Model
Image Captioning	COCO (Common Objects in Context)	BLEU-4	39.6	UNIMO-large
Image Captioning	COCO (Common Objects in Context)	CIDEr	127.7	UNIMO-large

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17 SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation2025-07-15 Latent Space Consistency for Sparse-View CT Reconstruction2025-07-15