TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-Grained Vision Language Pre-Training: Aligning Texts...

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Yan Zeng, Xinsong Zhang, Hang Li

2021-11-16Cross-Modal RetrievalVisual GroundingOpen Vocabulary Attribute DetectionReferring Expression SegmentationImage CaptioningVisual ReasoningVisual Question Answering (VQA)object-detectionObject DetectionImage Retrieval
PaperPDFCode(official)

Abstract

Most existing methods in vision language pre-training rely on object-centric features extracted through object detection and make fine-grained alignments between the extracted features and texts. It is challenging for these methods to learn relations among multiple objects. To this end, we propose a new method called X-VLM to perform `multi-grained vision language pre-training.' The key to learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy78.22X-VLM (base)
Visual ReasoningNLVR2 DevAccuracy84.41X-VLM (base)
Visual ReasoningNLVR2 TestAccuracy84.76X-VLM (base)
Image CaptioningCOCO CaptionsBLEU-441.3X-VLM (base)
Image CaptioningCOCO CaptionsCIDER140.8X-VLM (base)
Image RetrievalFlickr30K 1K testR@186.9X-VLM (base)
Image RetrievalFlickr30K 1K testR@1098.7X-VLM (base)
Image RetrievalFlickr30K 1K testR@597.3X-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@197.1X-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@10100X-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@5100X-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@186.9X-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1098.7X-VLM (base)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@597.3X-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@181.2X-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1098.2X-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@595.6X-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@163.4X-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1091.5X-VLM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@585.8X-VLM (base)
Object DetectionOVAD-Box benchmarkmean average precision28X-VLM
3DOVAD-Box benchmarkmean average precision28X-VLM
Visual GroundingRefCOCO+ test BAccuracy (%)76.91X-VLM (base)
Visual GroundingRefCOCO+ valAccuracy (%)84.51X-VLM (base)
Visual GroundingRefCOCO+ testAAccuracy (%)89X-VLM (base)
2D ClassificationOVAD-Box benchmarkmean average precision28X-VLM
2D Object DetectionOVAD-Box benchmarkmean average precision28X-VLM
Cross-Modal Information RetrievalFlickr30kImage-to-text R@197.1X-VLM (base)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@10100X-VLM (base)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@5100X-VLM (base)
Cross-Modal Information RetrievalFlickr30kText-to-image R@186.9X-VLM (base)
Cross-Modal Information RetrievalFlickr30kText-to-image R@1098.7X-VLM (base)
Cross-Modal Information RetrievalFlickr30kText-to-image R@597.3X-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@181.2X-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1098.2X-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@595.6X-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@163.4X-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1091.5X-VLM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@585.8X-VLM (base)
Open Vocabulary Object DetectionOVAD-Box benchmarkmean average precision28X-VLM
Cross-Modal RetrievalFlickr30kImage-to-text R@197.1X-VLM (base)
Cross-Modal RetrievalFlickr30kImage-to-text R@10100X-VLM (base)
Cross-Modal RetrievalFlickr30kImage-to-text R@5100X-VLM (base)
Cross-Modal RetrievalFlickr30kText-to-image R@186.9X-VLM (base)
Cross-Modal RetrievalFlickr30kText-to-image R@1098.7X-VLM (base)
Cross-Modal RetrievalFlickr30kText-to-image R@597.3X-VLM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@181.2X-VLM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@1098.2X-VLM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@595.6X-VLM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@163.4X-VLM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@1091.5X-VLM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@585.8X-VLM (base)
16kOVAD-Box benchmarkmean average precision28X-VLM

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17