TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-...

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Minzhe Niu, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei zhang, Xin Jiang, Chunjing Xu, Hang Xu

2022-02-14BenchmarkingImage-text RetrievalImage ClassificationZero-Shot Image ClassificationText RetrievalContrastive LearningRetrievalZero-shot Image RetrievalImage Retrieval
PaperPDFCode

Abstract

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.

Results

TaskDatasetMetricValueModel
Image RetrievalMUGE RetrievalMean Recall72.1Wukong (ViT-L/14)
Image RetrievalMUGE RetrievalR@152.7Wukong (ViT-L/14)
Image RetrievalMUGE RetrievalR@1085.6Wukong (ViT-L/14)
Image RetrievalMUGE RetrievalR@577.9Wukong (ViT-L/14)
Image RetrievalMUGE RetrievalMean Recall61.2Wukong (ViT-B/32)
Image RetrievalMUGE RetrievalR@139.2Wukong (ViT-B/32)
Image RetrievalMUGE RetrievalR@1077.4Wukong (ViT-B/32)
Image RetrievalMUGE RetrievalR@566.9Wukong (ViT-B/32)
Image RetrievalFlickr30k-CNR@177.4Wukong (ViT-L/14)
Image RetrievalFlickr30k-CNR@1097Wukong (ViT-L/14)
Image RetrievalFlickr30k-CNR@594.5Wukong (ViT-L/14)
Image RetrievalFlickr30k-CNR@167.6Wukong (ViT-B/32)
Image RetrievalFlickr30k-CNR@1094.2Wukong (ViT-B/32)
Image RetrievalFlickr30k-CNR@589.6Wukong (ViT-B/32)
Image RetrievalCOCO-CNR@174Wukong (ViT-L/14)
Image RetrievalCOCO-CNR@1098.1Wukong (ViT-L/14)
Image RetrievalCOCO-CNR@594.4Wukong (ViT-L/14)
Image RetrievalCOCO-CNR@167Wukong (ViT-B/32)
Image RetrievalCOCO-CNR@1096.7Wukong (ViT-B/32)
Image RetrievalCOCO-CNR@591.4Wukong (ViT-B/32)

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Training Transformers with Enforced Lipschitz Constants2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17