TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/X$^2$-VLM: All-In-One Pre-trained Model For Vision-Languag...

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, Wangchunshu Zhou

2022-11-22Cross-Modal RetrievalVisual GroundingVideo RetrievalText to Video RetrievalVideo Question AnsweringImage CaptioningXLM-RVisual ReasoningAllVisual Question Answering (VQA)
PaperPDFCode(official)Code

Abstract

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@149.6X2-VLM (large)
VideoMSR-VTT-1kAtext-to-video R@1084.2X2-VLM (large)
VideoMSR-VTT-1kAtext-to-video R@576.7X2-VLM (large)
VideoMSR-VTT-1kAtext-to-video R@147.6X2-VLM (base)
VideoMSR-VTT-1kAtext-to-video R@1084.2X2-VLM (base)
VideoMSR-VTT-1kAtext-to-video R@574.1X2-VLM (base)
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.455X2-VLM (large)
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.45X2-VLM (base)
Visual Question Answering (VQA)MSVD-QAAccuracy0.546X2-VLM (large)
Visual Question Answering (VQA)MSVD-QAAccuracy0.528X2-VLM (base)
Visual Question Answering (VQA)VQA v2 test-devAccuracy81.9X2-VLM (large)
Visual Question Answering (VQA)VQA v2 test-devAccuracy80.4X2-VLM (base)
Visual Question Answering (VQA)VQA v2 test-stdoverall81.8X2-VLM (large)
Visual Question Answering (VQA)VQA v2 test-stdoverall80.2X2-VLM (base)
Visual ReasoningNLVR2 DevAccuracy88.7X2-VLM (large)
Visual ReasoningNLVR2 DevAccuracy86.2X2-VLM (base)
Visual ReasoningNLVR2 TestAccuracy89.4X2-VLM (large)
Visual ReasoningNLVR2 TestAccuracy87X2-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@198.8X2-VLM (large)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@10100X2-VLM (large)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@5100X2-VLM (large)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@191.8X2-VLM (large)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1099.5X2-VLM (large)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@598.6X2-VLM (large)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@198.5X2-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@10100X2-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@5100X2-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@190.4X2-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1099.3X2-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@598.2X2-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@184.4X2-VLM (large)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1098.5X2-VLM (large)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@596.5X2-VLM (large)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@167.7X2-VLM (large)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1092.5X2-VLM (large)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@587.5X2-VLM (large)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@183.5X2-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1098.5X2-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@596.3X2-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@166.2X2-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1092.2X2-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@587.1X2-VLM (base)
Video RetrievalMSR-VTT-1kAtext-to-video R@149.6X2-VLM (large)
Video RetrievalMSR-VTT-1kAtext-to-video R@1084.2X2-VLM (large)
Video RetrievalMSR-VTT-1kAtext-to-video R@576.7X2-VLM (large)
Video RetrievalMSR-VTT-1kAtext-to-video R@147.6X2-VLM (base)
Video RetrievalMSR-VTT-1kAtext-to-video R@1084.2X2-VLM (base)
Video RetrievalMSR-VTT-1kAtext-to-video R@574.1X2-VLM (base)
Visual GroundingRefCOCO+ test BAccuracy (%)81.8X2-VLM (large)
Visual GroundingRefCOCO+ test BAccuracy (%)78.4X2-VLM (base)
Visual GroundingRefCOCO+ valAccuracy (%)87.6X2-VLM (large)
Visual GroundingRefCOCO+ valAccuracy (%)85.2X2-VLM (base)
Visual GroundingRefCOCO+ testAAccuracy (%)92.1X2-VLM (large)
Visual GroundingRefCOCO+ testAAccuracy (%)90.3X2-VLM (base)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@198.8X2-VLM (large)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@10100X2-VLM (large)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@5100X2-VLM (large)
Cross-Modal Information RetrievalFlickr30kText-to-image R@191.8X2-VLM (large)
Cross-Modal Information RetrievalFlickr30kText-to-image R@1099.5X2-VLM (large)
Cross-Modal Information RetrievalFlickr30kText-to-image R@598.6X2-VLM (large)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@198.5X2-VLM (base)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@10100X2-VLM (base)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@5100X2-VLM (base)
Cross-Modal Information RetrievalFlickr30kText-to-image R@190.4X2-VLM (base)
Cross-Modal Information RetrievalFlickr30kText-to-image R@1099.3X2-VLM (base)
Cross-Modal Information RetrievalFlickr30kText-to-image R@598.2X2-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@184.4X2-VLM (large)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1098.5X2-VLM (large)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@596.5X2-VLM (large)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@167.7X2-VLM (large)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1092.5X2-VLM (large)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@587.5X2-VLM (large)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@183.5X2-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1098.5X2-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@596.3X2-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@166.2X2-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1092.2X2-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@587.1X2-VLM (base)
Cross-Modal RetrievalFlickr30kImage-to-text R@198.8X2-VLM (large)
Cross-Modal RetrievalFlickr30kImage-to-text R@10100X2-VLM (large)
Cross-Modal RetrievalFlickr30kImage-to-text R@5100X2-VLM (large)
Cross-Modal RetrievalFlickr30kText-to-image R@191.8X2-VLM (large)
Cross-Modal RetrievalFlickr30kText-to-image R@1099.5X2-VLM (large)
Cross-Modal RetrievalFlickr30kText-to-image R@598.6X2-VLM (large)
Cross-Modal RetrievalFlickr30kImage-to-text R@198.5X2-VLM (base)
Cross-Modal RetrievalFlickr30kImage-to-text R@10100X2-VLM (base)
Cross-Modal RetrievalFlickr30kImage-to-text R@5100X2-VLM (base)
Cross-Modal RetrievalFlickr30kText-to-image R@190.4X2-VLM (base)
Cross-Modal RetrievalFlickr30kText-to-image R@1099.3X2-VLM (base)
Cross-Modal RetrievalFlickr30kText-to-image R@598.2X2-VLM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@184.4X2-VLM (large)
Cross-Modal RetrievalCOCO 2014Image-to-text R@1098.5X2-VLM (large)
Cross-Modal RetrievalCOCO 2014Image-to-text R@596.5X2-VLM (large)
Cross-Modal RetrievalCOCO 2014Text-to-image R@167.7X2-VLM (large)
Cross-Modal RetrievalCOCO 2014Text-to-image R@1092.5X2-VLM (large)
Cross-Modal RetrievalCOCO 2014Text-to-image R@587.5X2-VLM (large)
Cross-Modal RetrievalCOCO 2014Image-to-text R@183.5X2-VLM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@1098.5X2-VLM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@596.3X2-VLM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@166.2X2-VLM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@1092.2X2-VLM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@587.1X2-VLM (base)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15Modeling Code: Is Text All You Need?2025-07-15