Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Minzhe Niu, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei zhang, Xin Jiang, Chunjing Xu, Hang Xu

2022-02-14Benchmarking Image-text Retrieval Image Classification Zero-Shot Image Classification Text Retrieval Contrastive Learning Retrieval Zero-shot Image Retrieval Image Retrieval

Paper PDF Code

Abstract

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval	MUGE Retrieval	Mean Recall	72.1	Wukong (ViT-L/14)
Image Retrieval	MUGE Retrieval	R@1	52.7	Wukong (ViT-L/14)
Image Retrieval	MUGE Retrieval	R@10	85.6	Wukong (ViT-L/14)
Image Retrieval	MUGE Retrieval	R@5	77.9	Wukong (ViT-L/14)
Image Retrieval	MUGE Retrieval	Mean Recall	61.2	Wukong (ViT-B/32)
Image Retrieval	MUGE Retrieval	R@1	39.2	Wukong (ViT-B/32)
Image Retrieval	MUGE Retrieval	R@10	77.4	Wukong (ViT-B/32)
Image Retrieval	MUGE Retrieval	R@5	66.9	Wukong (ViT-B/32)
Image Retrieval	Flickr30k-CN	R@1	77.4	Wukong (ViT-L/14)
Image Retrieval	Flickr30k-CN	R@10	97	Wukong (ViT-L/14)
Image Retrieval	Flickr30k-CN	R@5	94.5	Wukong (ViT-L/14)
Image Retrieval	Flickr30k-CN	R@1	67.6	Wukong (ViT-B/32)
Image Retrieval	Flickr30k-CN	R@10	94.2	Wukong (ViT-B/32)
Image Retrieval	Flickr30k-CN	R@5	89.6	Wukong (ViT-B/32)
Image Retrieval	COCO-CN	R@1	74	Wukong (ViT-L/14)
Image Retrieval	COCO-CN	R@10	98.1	Wukong (ViT-L/14)
Image Retrieval	COCO-CN	R@5	94.4	Wukong (ViT-L/14)
Image Retrieval	COCO-CN	R@1	67	Wukong (ViT-B/32)
Image Retrieval	COCO-CN	R@10	96.7	Wukong (ViT-B/32)
Image Retrieval	COCO-CN	R@5	91.4	Wukong (ViT-B/32)

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Abstract

Results

Related Papers

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Abstract

Results

Related Papers