M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Qingpei Guo, Furong Xu, Hanxiao Zhang, Wang Ren, Ziping Ma, Lin Ju, Jian Wang, Jingdong Chen, Ming Yang

2024-01-29Zero-Shot Cross-Modal Retrieval Zero-shot Text-to-Image Retrieval Zero-Shot Transfer Image Classification Zero-Shot Learning Zero-shot Image Retrieval

Paper PDF Code(official)

Abstract

Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Learning	ImageNet_CN	Accuracy	80.7	$M^2$-Encoder
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	91.2	M2-Encoder
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	99.6	M2-Encoder
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	99.2	M2-Encoder
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	92.2	M2-Encoder
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	99.7	M2-Encoder
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	99.5	M2-Encoder
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	72.8	M2-Encoder
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	96.3	M2-Encoder
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	92.3	M2-Encoder
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	56.5	M2-Encoder
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	88.8	M2-Encoder
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	81.6	M2-Encoder
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	88.5	M2-Encoder

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Abstract

Results

Related Papers

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Abstract

Results

Related Papers