X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, Wangchunshu Zhou

2022-11-22Cross-Modal Retrieval Visual Grounding Video Retrieval Text to Video Retrieval Video Question Answering Image Captioning XLM-R Visual Reasoning All Visual Question Answering (VQA)

Paper PDF Code(official)Code

Abstract

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	49.6	X2-VLM (large)
Video	MSR-VTT-1kA	text-to-video R@10	84.2	X2-VLM (large)
Video	MSR-VTT-1kA	text-to-video R@5	76.7	X2-VLM (large)
Video	MSR-VTT-1kA	text-to-video R@1	47.6	X2-VLM (base)
Video	MSR-VTT-1kA	text-to-video R@10	84.2	X2-VLM (base)
Video	MSR-VTT-1kA	text-to-video R@5	74.1	X2-VLM (base)
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.455	X2-VLM (large)
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.45	X2-VLM (base)
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.546	X2-VLM (large)
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.528	X2-VLM (base)
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	81.9	X2-VLM (large)
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	80.4	X2-VLM (base)
Visual Question Answering (VQA)	VQA v2 test-std	overall	81.8	X2-VLM (large)
Visual Question Answering (VQA)	VQA v2 test-std	overall	80.2	X2-VLM (base)
Visual Reasoning	NLVR2 Dev	Accuracy	88.7	X2-VLM (large)
Visual Reasoning	NLVR2 Dev	Accuracy	86.2	X2-VLM (base)
Visual Reasoning	NLVR2 Test	Accuracy	89.4	X2-VLM (large)
Visual Reasoning	NLVR2 Test	Accuracy	87	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	98.8	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	100	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	100	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	91.8	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	99.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	98.6	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	98.5	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	100	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	100	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	90.4	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	99.3	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	98.2	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	84.4	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	98.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	96.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	67.7	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	92.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	87.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	83.5	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	98.5	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	96.3	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	66.2	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	92.2	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	87.1	X2-VLM (base)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	49.6	X2-VLM (large)
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	84.2	X2-VLM (large)
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	76.7	X2-VLM (large)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	47.6	X2-VLM (base)
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	84.2	X2-VLM (base)
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	74.1	X2-VLM (base)
Visual Grounding	RefCOCO+ test B	Accuracy (%)	81.8	X2-VLM (large)
Visual Grounding	RefCOCO+ test B	Accuracy (%)	78.4	X2-VLM (base)
Visual Grounding	RefCOCO+ val	Accuracy (%)	87.6	X2-VLM (large)
Visual Grounding	RefCOCO+ val	Accuracy (%)	85.2	X2-VLM (base)
Visual Grounding	RefCOCO+ testA	Accuracy (%)	92.1	X2-VLM (large)
Visual Grounding	RefCOCO+ testA	Accuracy (%)	90.3	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@1	98.8	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@10	100	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@5	100	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@1	91.8	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@10	99.5	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@5	98.6	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@1	98.5	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@10	100	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@5	100	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@1	90.4	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@10	99.3	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@5	98.2	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@1	84.4	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@10	98.5	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@5	96.5	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@1	67.7	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@10	92.5	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	87.5	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@1	83.5	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@10	98.5	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@5	96.3	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@1	66.2	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@10	92.2	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	87.1	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@1	98.8	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@10	100	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@5	100	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@1	91.8	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@10	99.5	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@5	98.6	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@1	98.5	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@10	100	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@5	100	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@1	90.4	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@10	99.3	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@5	98.2	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@1	84.4	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@10	98.5	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@5	96.5	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@1	67.7	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@10	92.5	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	87.5	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@1	83.5	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@10	98.5	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@5	96.3	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@1	66.2	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@10	92.2	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	87.1	X2-VLM (base)

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	49.6	X2-VLM (large)
Video	MSR-VTT-1kA	text-to-video R@10	84.2	X2-VLM (large)
Video	MSR-VTT-1kA	text-to-video R@5	76.7	X2-VLM (large)
Video	MSR-VTT-1kA	text-to-video R@1	47.6	X2-VLM (base)
Video	MSR-VTT-1kA	text-to-video R@10	84.2	X2-VLM (base)
Video	MSR-VTT-1kA	text-to-video R@5	74.1	X2-VLM (base)
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.455	X2-VLM (large)
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.45	X2-VLM (base)
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.546	X2-VLM (large)
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.528	X2-VLM (base)
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	81.9	X2-VLM (large)
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	80.4	X2-VLM (base)
Visual Question Answering (VQA)	VQA v2 test-std	overall	81.8	X2-VLM (large)
Visual Question Answering (VQA)	VQA v2 test-std	overall	80.2	X2-VLM (base)
Visual Reasoning	NLVR2 Dev	Accuracy	88.7	X2-VLM (large)
Visual Reasoning	NLVR2 Dev	Accuracy	86.2	X2-VLM (base)
Visual Reasoning	NLVR2 Test	Accuracy	89.4	X2-VLM (large)
Visual Reasoning	NLVR2 Test	Accuracy	87	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	98.8	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	100	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	100	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	91.8	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	99.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	98.6	X2-VLM (large)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	98.5	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	100	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	100	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	90.4	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	99.3	X2-VLM (base)
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	98.2	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	84.4	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	98.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	96.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	67.7	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	92.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	87.5	X2-VLM (large)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	83.5	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	98.5	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	96.3	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	66.2	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	92.2	X2-VLM (base)
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	87.1	X2-VLM (base)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	49.6	X2-VLM (large)
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	84.2	X2-VLM (large)
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	76.7	X2-VLM (large)
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	47.6	X2-VLM (base)
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	84.2	X2-VLM (base)
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	74.1	X2-VLM (base)
Visual Grounding	RefCOCO+ test B	Accuracy (%)	81.8	X2-VLM (large)
Visual Grounding	RefCOCO+ test B	Accuracy (%)	78.4	X2-VLM (base)
Visual Grounding	RefCOCO+ val	Accuracy (%)	87.6	X2-VLM (large)
Visual Grounding	RefCOCO+ val	Accuracy (%)	85.2	X2-VLM (base)
Visual Grounding	RefCOCO+ testA	Accuracy (%)	92.1	X2-VLM (large)
Visual Grounding	RefCOCO+ testA	Accuracy (%)	90.3	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@1	98.8	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@10	100	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@5	100	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@1	91.8	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@10	99.5	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@5	98.6	X2-VLM (large)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@1	98.5	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@10	100	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@5	100	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@1	90.4	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@10	99.3	X2-VLM (base)
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@5	98.2	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@1	84.4	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@10	98.5	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@5	96.5	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@1	67.7	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@10	92.5	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	87.5	X2-VLM (large)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@1	83.5	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@10	98.5	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@5	96.3	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@1	66.2	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@10	92.2	X2-VLM (base)
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	87.1	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@1	98.8	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@10	100	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@5	100	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@1	91.8	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@10	99.5	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@5	98.6	X2-VLM (large)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@1	98.5	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@10	100	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Image-to-text R@5	100	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@1	90.4	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@10	99.3	X2-VLM (base)
Cross-Modal Retrieval	Flickr30k	Text-to-image R@5	98.2	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@1	84.4	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@10	98.5	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@5	96.5	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@1	67.7	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@10	92.5	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	87.5	X2-VLM (large)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@1	83.5	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@10	98.5	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Image-to-text R@5	96.3	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@1	66.2	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@10	92.2	X2-VLM (base)
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	87.1	X2-VLM (base)

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Abstract

Results

Related Papers

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Abstract

Results

Related Papers