Chunyu Xie, Heng Cai, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Jianfei Song, Henrique Morimitsu, Lin Yao, Dexin Wang, Xiangzheng Zhang, Dawei Leng, Baochang Zhang, Xiangyang Ji, Yafeng Deng
Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Retrieval | MUGE Retrieval | Mean Recall | 77.5 | R2D2 (ViT-L/14) |
| Image Retrieval | MUGE Retrieval | R@1 | 60.1 | R2D2 (ViT-L/14) |
| Image Retrieval | MUGE Retrieval | R@10 | 89.4 | R2D2 (ViT-L/14) |
| Image Retrieval | MUGE Retrieval | R@5 | 82.9 | R2D2 (ViT-L/14) |
| Image Retrieval | MUGE Retrieval | Mean Recall | 68.7 | R2D2 (ViT-B) |
| Image Retrieval | MUGE Retrieval | R@1 | 47.4 | R2D2 (ViT-B) |
| Image Retrieval | MUGE Retrieval | R@10 | 83.5 | R2D2 (ViT-B) |
| Image Retrieval | MUGE Retrieval | R@5 | 75.1 | R2D2 (ViT-B) |
| Image Retrieval | Flickr30k-CN | R@1 | 84.4 | R2D2 (ViT-L/14) |
| Image Retrieval | Flickr30k-CN | R@10 | 98.4 | R2D2 (ViT-L/14) |
| Image Retrieval | Flickr30k-CN | R@5 | 96.7 | R2D2 (ViT-L/14) |
| Image Retrieval | Flickr30k-CN | R@1 | 78.3 | R2D2 (ViT-B) |
| Image Retrieval | Flickr30k-CN | R@10 | 97 | R2D2 (ViT-B) |
| Image Retrieval | Flickr30k-CN | R@5 | 94.6 | R2D2 (ViT-B) |
| Image Retrieval | COCO-CN | R@1 | 79.1 | R2D2 (ViT-L/14) |
| Image Retrieval | COCO-CN | R@10 | 98.9 | R2D2 (ViT-L/14) |
| Image Retrieval | COCO-CN | R@5 | 96.5 | R2D2 (ViT-L/14) |
| Image Retrieval | COCO-CN | R@1 | 75.1 | R2D2 (ViT-B) |
| Image Retrieval | COCO-CN | R@10 | 98.1 | R2D2 (ViT-B) |
| Image Retrieval | COCO-CN | R@5 | 94.2 | R2D2 (ViT-B) |