TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-T...

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Bin Shan, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang

2022-09-30Cross-Modal RetrievalZero-Shot Cross-Modal RetrievalZero-shot Text-to-Image RetrievalContrastive LearningImage-to-Text RetrievalRetrievalZero-shot Image RetrievalImage Retrieval
PaperPDFCode(official)

Abstract

Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.

Results

TaskDatasetMetricValueModel
Image RetrievalAIC-ICCRecall@119ERNIE-ViL2.0
Image RetrievalAIC-ICCRecall@1043.5ERNIE-ViL2.0
Image RetrievalAIC-ICCRecall@535.3ERNIE-ViL2.0
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@197.2ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@10100ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@5100ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@193.3ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1099.8ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@599.4ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@177.4ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1097.1ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@593.6ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@159.5ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1090.1ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@583.4ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@191.2ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.8ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.1ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@177.4ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1096.4ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@593.8ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@163.1ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1091.4ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@585.7ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@146ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1080.4ERNIE-ViL 2.0
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@571.4ERNIE-ViL 2.0
Cross-Modal Information RetrievalFlickr30kImage-to-text R@197.2ERNIE-ViL 2.0
Cross-Modal Information RetrievalFlickr30kImage-to-text R@10100ERNIE-ViL 2.0
Cross-Modal Information RetrievalFlickr30kImage-to-text R@5100ERNIE-ViL 2.0
Cross-Modal Information RetrievalFlickr30kText-to-image R@193.3ERNIE-ViL 2.0
Cross-Modal Information RetrievalFlickr30kText-to-image R@1099.8ERNIE-ViL 2.0
Cross-Modal Information RetrievalFlickr30kText-to-image R@599.4ERNIE-ViL 2.0
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@177.4ERNIE-ViL 2.0
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1097.1ERNIE-ViL 2.0
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@593.6ERNIE-ViL 2.0
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@159.5ERNIE-ViL 2.0
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1090.1ERNIE-ViL 2.0
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@583.4ERNIE-ViL 2.0
Cross-Modal RetrievalFlickr30kImage-to-text R@197.2ERNIE-ViL 2.0
Cross-Modal RetrievalFlickr30kImage-to-text R@10100ERNIE-ViL 2.0
Cross-Modal RetrievalFlickr30kImage-to-text R@5100ERNIE-ViL 2.0
Cross-Modal RetrievalFlickr30kText-to-image R@193.3ERNIE-ViL 2.0
Cross-Modal RetrievalFlickr30kText-to-image R@1099.8ERNIE-ViL 2.0
Cross-Modal RetrievalFlickr30kText-to-image R@599.4ERNIE-ViL 2.0
Cross-Modal RetrievalCOCO 2014Image-to-text R@177.4ERNIE-ViL 2.0
Cross-Modal RetrievalCOCO 2014Image-to-text R@1097.1ERNIE-ViL 2.0
Cross-Modal RetrievalCOCO 2014Image-to-text R@593.6ERNIE-ViL 2.0
Cross-Modal RetrievalCOCO 2014Text-to-image R@159.5ERNIE-ViL 2.0
Cross-Modal RetrievalCOCO 2014Text-to-image R@1090.1ERNIE-ViL 2.0
Cross-Modal RetrievalCOCO 2014Text-to-image R@583.4ERNIE-ViL 2.0
Image-to-Text RetrievalAIC-ICCRecall@133.7ERNIE-ViL2.0
Image-to-Text RetrievalAIC-ICCRecall@1060ERNIE-ViL2.0
Image-to-Text RetrievalAIC-ICCRecall@552.1ERNIE-ViL2.0
Image-to-Text RetrievalFlickr30kRecall@196.1ERNIE-ViL 2.0
Image-to-Text RetrievalFlickr30kRecall@10100ERNIE-ViL 2.0
Image-to-Text RetrievalFlickr30kRecall@599.9ERNIE-ViL 2.0

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17