TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/M2-Encoder: Advancing Bilingual Image-Text Understanding b...

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Qingpei Guo, Furong Xu, Hanxiao Zhang, Wang Ren, Ziping Ma, Lin Ju, Jian Wang, Jingdong Chen, Ming Yang

2024-01-29Zero-Shot Cross-Modal RetrievalZero-shot Text-to-Image RetrievalZero-Shot Transfer Image ClassificationZero-Shot LearningZero-shot Image Retrieval
PaperPDFCode(official)

Abstract

Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.

Results

TaskDatasetMetricValueModel
Zero-Shot LearningImageNet_CNAccuracy80.7$M^2$-Encoder
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@191.2M2-Encoder
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.6M2-Encoder
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.2M2-Encoder
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@192.2M2-Encoder
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1099.7M2-Encoder
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@599.5M2-Encoder
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@172.8M2-Encoder
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1096.3M2-Encoder
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@592.3M2-Encoder
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@156.5M2-Encoder
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1088.8M2-Encoder
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@581.6M2-Encoder
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)88.5M2-Encoder

Related Papers

GLAD: Generalizable Tuning for Vision-Language Models2025-07-17DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation2025-07-14An analysis of vision-language models for fabric retrieval2025-07-07EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning2025-06-26Zero-Shot Learning for Obsolescence Risk Forecasting2025-06-26SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement2025-06-23Generalizable Agent Modeling for Agent Collaboration-Competition Adaptation with Multi-Retrieval and Dynamic Generation2025-06-20