Bin Shan, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Retrieval | AIC-ICC | Recall@1 | 19 | ERNIE-ViL2.0 |
| Image Retrieval | AIC-ICC | Recall@10 | 43.5 | ERNIE-ViL2.0 |
| Image Retrieval | AIC-ICC | Recall@5 | 35.3 | ERNIE-ViL2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 97.2 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 100 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 100 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 93.3 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 99.8 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 99.4 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 77.4 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 97.1 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 93.6 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 59.5 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 90.1 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 83.4 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 91.2 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 99.8 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 99.1 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 77.4 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 96.4 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 93.8 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 63.1 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 91.4 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 85.7 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 46 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 80.4 | ERNIE-ViL 2.0 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 71.4 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@1 | 97.2 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@10 | 100 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@5 | 100 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@1 | 93.3 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@10 | 99.8 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@5 | 99.4 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@1 | 77.4 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@10 | 97.1 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@5 | 93.6 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@1 | 59.5 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@10 | 90.1 | ERNIE-ViL 2.0 |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 83.4 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@1 | 97.2 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@10 | 100 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@5 | 100 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@1 | 93.3 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@10 | 99.8 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@5 | 99.4 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@1 | 77.4 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@10 | 97.1 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@5 | 93.6 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@1 | 59.5 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@10 | 90.1 | ERNIE-ViL 2.0 |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 83.4 | ERNIE-ViL 2.0 |
| Image-to-Text Retrieval | AIC-ICC | Recall@1 | 33.7 | ERNIE-ViL2.0 |
| Image-to-Text Retrieval | AIC-ICC | Recall@10 | 60 | ERNIE-ViL2.0 |
| Image-to-Text Retrieval | AIC-ICC | Recall@5 | 52.1 | ERNIE-ViL2.0 |
| Image-to-Text Retrieval | Flickr30k | Recall@1 | 96.1 | ERNIE-ViL 2.0 |
| Image-to-Text Retrieval | Flickr30k | Recall@10 | 100 | ERNIE-ViL 2.0 |
| Image-to-Text Retrieval | Flickr30k | Recall@5 | 99.9 | ERNIE-ViL 2.0 |