Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, Wangchunshu Zhou
Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video R@1 | 49.6 | X2-VLM (large) |
| Video | MSR-VTT-1kA | text-to-video R@10 | 84.2 | X2-VLM (large) |
| Video | MSR-VTT-1kA | text-to-video R@5 | 76.7 | X2-VLM (large) |
| Video | MSR-VTT-1kA | text-to-video R@1 | 47.6 | X2-VLM (base) |
| Video | MSR-VTT-1kA | text-to-video R@10 | 84.2 | X2-VLM (base) |
| Video | MSR-VTT-1kA | text-to-video R@5 | 74.1 | X2-VLM (base) |
| Visual Question Answering (VQA) | MSRVTT-QA | Accuracy | 0.455 | X2-VLM (large) |
| Visual Question Answering (VQA) | MSRVTT-QA | Accuracy | 0.45 | X2-VLM (base) |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.546 | X2-VLM (large) |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.528 | X2-VLM (base) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 81.9 | X2-VLM (large) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 80.4 | X2-VLM (base) |
| Visual Question Answering (VQA) | VQA v2 test-std | overall | 81.8 | X2-VLM (large) |
| Visual Question Answering (VQA) | VQA v2 test-std | overall | 80.2 | X2-VLM (base) |
| Visual Reasoning | NLVR2 Dev | Accuracy | 88.7 | X2-VLM (large) |
| Visual Reasoning | NLVR2 Dev | Accuracy | 86.2 | X2-VLM (base) |
| Visual Reasoning | NLVR2 Test | Accuracy | 89.4 | X2-VLM (large) |
| Visual Reasoning | NLVR2 Test | Accuracy | 87 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 98.8 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 100 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 100 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 91.8 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 99.5 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 98.6 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 98.5 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 100 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 100 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 90.4 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 99.3 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 98.2 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 84.4 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 98.5 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 96.5 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 67.7 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 92.5 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 87.5 | X2-VLM (large) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 83.5 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 98.5 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 96.3 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 66.2 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 92.2 | X2-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 87.1 | X2-VLM (base) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 49.6 | X2-VLM (large) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 84.2 | X2-VLM (large) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 76.7 | X2-VLM (large) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 47.6 | X2-VLM (base) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 84.2 | X2-VLM (base) |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 74.1 | X2-VLM (base) |
| Visual Grounding | RefCOCO+ test B | Accuracy (%) | 81.8 | X2-VLM (large) |
| Visual Grounding | RefCOCO+ test B | Accuracy (%) | 78.4 | X2-VLM (base) |
| Visual Grounding | RefCOCO+ val | Accuracy (%) | 87.6 | X2-VLM (large) |
| Visual Grounding | RefCOCO+ val | Accuracy (%) | 85.2 | X2-VLM (base) |
| Visual Grounding | RefCOCO+ testA | Accuracy (%) | 92.1 | X2-VLM (large) |
| Visual Grounding | RefCOCO+ testA | Accuracy (%) | 90.3 | X2-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@1 | 98.8 | X2-VLM (large) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@10 | 100 | X2-VLM (large) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@5 | 100 | X2-VLM (large) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@1 | 91.8 | X2-VLM (large) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@10 | 99.5 | X2-VLM (large) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@5 | 98.6 | X2-VLM (large) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@1 | 98.5 | X2-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@10 | 100 | X2-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@5 | 100 | X2-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@1 | 90.4 | X2-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@10 | 99.3 | X2-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@5 | 98.2 | X2-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@1 | 84.4 | X2-VLM (large) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@10 | 98.5 | X2-VLM (large) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@5 | 96.5 | X2-VLM (large) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@1 | 67.7 | X2-VLM (large) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@10 | 92.5 | X2-VLM (large) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 87.5 | X2-VLM (large) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@1 | 83.5 | X2-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@10 | 98.5 | X2-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@5 | 96.3 | X2-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@1 | 66.2 | X2-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@10 | 92.2 | X2-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 87.1 | X2-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@1 | 98.8 | X2-VLM (large) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@10 | 100 | X2-VLM (large) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@5 | 100 | X2-VLM (large) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@1 | 91.8 | X2-VLM (large) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@10 | 99.5 | X2-VLM (large) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@5 | 98.6 | X2-VLM (large) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@1 | 98.5 | X2-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@10 | 100 | X2-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@5 | 100 | X2-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@1 | 90.4 | X2-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@10 | 99.3 | X2-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@5 | 98.2 | X2-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@1 | 84.4 | X2-VLM (large) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@10 | 98.5 | X2-VLM (large) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@5 | 96.5 | X2-VLM (large) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@1 | 67.7 | X2-VLM (large) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@10 | 92.5 | X2-VLM (large) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 87.5 | X2-VLM (large) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@1 | 83.5 | X2-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@10 | 98.5 | X2-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@5 | 96.3 | X2-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@1 | 66.2 | X2-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@10 | 92.2 | X2-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 87.1 | X2-VLM (base) |