TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Grounded Language-Image Pre-training

Grounded Language-Image Pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao

2021-12-07CVPR 2022 1Described Object DetectionFew-Shot Object DetectionZero-Shot Object Detection2D Object DetectionObject Detection
PaperPDFCodeCode(official)Code

Abstract

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code is released at https://github.com/microsoft/GLIP.

Results

TaskDatasetMetricValueModel
Phrase GroundingFlickr30k Entities TestR@187.1GLIP
Phrase GroundingFlickr30k Entities TestR@1098.1GLIP
Phrase GroundingFlickr30k Entities TestR@596.9GLIP
Object DetectionCOCO test-devAP5079.5GLIP (Swin-L, multi-scale)
Object DetectionCOCO test-devAP7567.7GLIP (Swin-L, multi-scale)
Object DetectionCOCO test-devAPL75GLIP (Swin-L, multi-scale)
Object DetectionCOCO test-devAPM64.9GLIP (Swin-L, multi-scale)
Object DetectionCOCO test-devAPS45.3GLIP (Swin-L, multi-scale)
Object DetectionCOCO test-devbox mAP61.5GLIP (Swin-L, multi-scale)
Object DetectionCOCO-OAverage mAP48GLIP-L (Swin-L)
Object DetectionCOCO-OEffective Robustness24.89GLIP-L (Swin-L)
Object DetectionCOCO-OAverage mAP29.1GLIP-T (Swin-T)
Object DetectionCOCO-OEffective Robustness8.11GLIP-T (Swin-T)
Object DetectionODinW Full-Shot 13 TasksAP68.9GLIP
Object DetectionCOCO minivalbox AP60.8GLIP (Swin-L, multi-scale)
Object DetectionODinW-35Average Score38.9GLIP-T
Object DetectionODinW-13Average Score50.7GLIP-T
Object DetectionLVIS v1.0 minivalAP37.3GLIP-L
Object DetectionLVIS v1.0 valAP26.9GLIP-L
Object DetectionDescription Detection DatasetIntra-scenario ABS mAP21.5GLIP-T
Object DetectionDescription Detection DatasetIntra-scenario FULL mAP19.1GLIP-T
Object DetectionDescription Detection DatasetIntra-scenario PRES mAP18.3GLIP-T
3DCOCO test-devAP5079.5GLIP (Swin-L, multi-scale)
3DCOCO test-devAP7567.7GLIP (Swin-L, multi-scale)
3DCOCO test-devAPL75GLIP (Swin-L, multi-scale)
3DCOCO test-devAPM64.9GLIP (Swin-L, multi-scale)
3DCOCO test-devAPS45.3GLIP (Swin-L, multi-scale)
3DCOCO test-devbox mAP61.5GLIP (Swin-L, multi-scale)
3DCOCO-OAverage mAP48GLIP-L (Swin-L)
3DCOCO-OEffective Robustness24.89GLIP-L (Swin-L)
3DCOCO-OAverage mAP29.1GLIP-T (Swin-T)
3DCOCO-OEffective Robustness8.11GLIP-T (Swin-T)
3DODinW Full-Shot 13 TasksAP68.9GLIP
3DCOCO minivalbox AP60.8GLIP (Swin-L, multi-scale)
3DODinW-35Average Score38.9GLIP-T
3DODinW-13Average Score50.7GLIP-T
3DLVIS v1.0 minivalAP37.3GLIP-L
3DLVIS v1.0 valAP26.9GLIP-L
3DDescription Detection DatasetIntra-scenario ABS mAP21.5GLIP-T
3DDescription Detection DatasetIntra-scenario FULL mAP19.1GLIP-T
3DDescription Detection DatasetIntra-scenario PRES mAP18.3GLIP-T
Few-Shot Object DetectionODinW-35Average Score38.9GLIP-T
Few-Shot Object DetectionODinW-13Average Score50.7GLIP-T
2D ClassificationCOCO test-devAP5079.5GLIP (Swin-L, multi-scale)
2D ClassificationCOCO test-devAP7567.7GLIP (Swin-L, multi-scale)
2D ClassificationCOCO test-devAPL75GLIP (Swin-L, multi-scale)
2D ClassificationCOCO test-devAPM64.9GLIP (Swin-L, multi-scale)
2D ClassificationCOCO test-devAPS45.3GLIP (Swin-L, multi-scale)
2D ClassificationCOCO test-devbox mAP61.5GLIP (Swin-L, multi-scale)
2D ClassificationCOCO-OAverage mAP48GLIP-L (Swin-L)
2D ClassificationCOCO-OEffective Robustness24.89GLIP-L (Swin-L)
2D ClassificationCOCO-OAverage mAP29.1GLIP-T (Swin-T)
2D ClassificationCOCO-OEffective Robustness8.11GLIP-T (Swin-T)
2D ClassificationODinW Full-Shot 13 TasksAP68.9GLIP
2D ClassificationCOCO minivalbox AP60.8GLIP (Swin-L, multi-scale)
2D ClassificationODinW-35Average Score38.9GLIP-T
2D ClassificationODinW-13Average Score50.7GLIP-T
2D ClassificationLVIS v1.0 minivalAP37.3GLIP-L
2D ClassificationLVIS v1.0 valAP26.9GLIP-L
2D ClassificationDescription Detection DatasetIntra-scenario ABS mAP21.5GLIP-T
2D ClassificationDescription Detection DatasetIntra-scenario FULL mAP19.1GLIP-T
2D ClassificationDescription Detection DatasetIntra-scenario PRES mAP18.3GLIP-T
2D Object DetectionRF100Average mAP0.112GLIP
2D Object DetectionCOCO test-devAP5079.5GLIP (Swin-L, multi-scale)
2D Object DetectionCOCO test-devAP7567.7GLIP (Swin-L, multi-scale)
2D Object DetectionCOCO test-devAPL75GLIP (Swin-L, multi-scale)
2D Object DetectionCOCO test-devAPM64.9GLIP (Swin-L, multi-scale)
2D Object DetectionCOCO test-devAPS45.3GLIP (Swin-L, multi-scale)
2D Object DetectionCOCO test-devbox mAP61.5GLIP (Swin-L, multi-scale)
2D Object DetectionCOCO-OAverage mAP48GLIP-L (Swin-L)
2D Object DetectionCOCO-OEffective Robustness24.89GLIP-L (Swin-L)
2D Object DetectionCOCO-OAverage mAP29.1GLIP-T (Swin-T)
2D Object DetectionCOCO-OEffective Robustness8.11GLIP-T (Swin-T)
2D Object DetectionODinW Full-Shot 13 TasksAP68.9GLIP
2D Object DetectionCOCO minivalbox AP60.8GLIP (Swin-L, multi-scale)
2D Object DetectionODinW-35Average Score38.9GLIP-T
2D Object DetectionODinW-13Average Score50.7GLIP-T
2D Object DetectionLVIS v1.0 minivalAP37.3GLIP-L
2D Object DetectionLVIS v1.0 valAP26.9GLIP-L
2D Object DetectionDescription Detection DatasetIntra-scenario ABS mAP21.5GLIP-T
2D Object DetectionDescription Detection DatasetIntra-scenario FULL mAP19.1GLIP-T
2D Object DetectionDescription Detection DatasetIntra-scenario PRES mAP18.3GLIP-T
16kCOCO test-devAP5079.5GLIP (Swin-L, multi-scale)
16kCOCO test-devAP7567.7GLIP (Swin-L, multi-scale)
16kCOCO test-devAPL75GLIP (Swin-L, multi-scale)
16kCOCO test-devAPM64.9GLIP (Swin-L, multi-scale)
16kCOCO test-devAPS45.3GLIP (Swin-L, multi-scale)
16kCOCO test-devbox mAP61.5GLIP (Swin-L, multi-scale)
16kCOCO-OAverage mAP48GLIP-L (Swin-L)
16kCOCO-OEffective Robustness24.89GLIP-L (Swin-L)
16kCOCO-OAverage mAP29.1GLIP-T (Swin-T)
16kCOCO-OEffective Robustness8.11GLIP-T (Swin-T)
16kODinW Full-Shot 13 TasksAP68.9GLIP
16kCOCO minivalbox AP60.8GLIP (Swin-L, multi-scale)
16kODinW-35Average Score38.9GLIP-T
16kODinW-13Average Score50.7GLIP-T
16kLVIS v1.0 minivalAP37.3GLIP-L
16kLVIS v1.0 valAP26.9GLIP-L
16kDescription Detection DatasetIntra-scenario ABS mAP21.5GLIP-T
16kDescription Detection DatasetIntra-scenario FULL mAP19.1GLIP-T
16kDescription Detection DatasetIntra-scenario PRES mAP18.3GLIP-T

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07