TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CCMB: A Large-scale Chinese Cross-modal Benchmark

CCMB: A Large-scale Chinese Cross-modal Benchmark

Chunyu Xie, Heng Cai, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Jianfei Song, Henrique Morimitsu, Lin Yao, Dexin Wang, Xiangzheng Zhang, Dawei Leng, Baochang Zhang, Xiangyang Ji, Yafeng Deng

2022-05-08Text-to-Image GenerationImage-text RetrievalImage ClassificationImage-text matchingText MatchingZero-Shot Image ClassificationText RetrievalText to Image GenerationRetrievalImage GenerationZero-shot Image RetrievalImage Retrieval
PaperPDFCode(official)

Abstract

Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2

Results

TaskDatasetMetricValueModel
Image RetrievalMUGE RetrievalMean Recall77.5R2D2 (ViT-L/14)
Image RetrievalMUGE RetrievalR@160.1R2D2 (ViT-L/14)
Image RetrievalMUGE RetrievalR@1089.4R2D2 (ViT-L/14)
Image RetrievalMUGE RetrievalR@582.9R2D2 (ViT-L/14)
Image RetrievalMUGE RetrievalMean Recall68.7R2D2 (ViT-B)
Image RetrievalMUGE RetrievalR@147.4R2D2 (ViT-B)
Image RetrievalMUGE RetrievalR@1083.5R2D2 (ViT-B)
Image RetrievalMUGE RetrievalR@575.1R2D2 (ViT-B)
Image RetrievalFlickr30k-CNR@184.4R2D2 (ViT-L/14)
Image RetrievalFlickr30k-CNR@1098.4R2D2 (ViT-L/14)
Image RetrievalFlickr30k-CNR@596.7R2D2 (ViT-L/14)
Image RetrievalFlickr30k-CNR@178.3R2D2 (ViT-B)
Image RetrievalFlickr30k-CNR@1097R2D2 (ViT-B)
Image RetrievalFlickr30k-CNR@594.6R2D2 (ViT-B)
Image RetrievalCOCO-CNR@179.1R2D2 (ViT-L/14)
Image RetrievalCOCO-CNR@1098.9R2D2 (ViT-L/14)
Image RetrievalCOCO-CNR@596.5R2D2 (ViT-L/14)
Image RetrievalCOCO-CNR@175.1R2D2 (ViT-B)
Image RetrievalCOCO-CNR@1098.1R2D2 (ViT-B)
Image RetrievalCOCO-CNR@594.2R2D2 (ViT-B)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17