TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Grounding DINO: Marrying DINO with Grounded Pre-Training f...

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang

2023-03-09Referring ExpressionReferring Expression ComprehensionZero Shot SegmentationZero-Shot Object DetectionObject Detection
PaperPDFCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCode

Abstract

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.

Results

TaskDatasetMetricValueModel
Object DetectionCOCO test-devbox mAP63Grounding DINO
Object DetectionODinW Full-Shot 13 TasksAP70.9Grounding DINO
Object DetectionCOCO minivalbox AP63Grounding DINO
Object DetectionLVIS v1.0 minivalAP33.9GroundingDINO-L
Object DetectionMSCOCOAP52.5Grounding DINO-L (without COCO data)
Object DetectionODinWAverage Score26.1Grounding DINO
3DCOCO test-devbox mAP63Grounding DINO
3DODinW Full-Shot 13 TasksAP70.9Grounding DINO
3DCOCO minivalbox AP63Grounding DINO
3DLVIS v1.0 minivalAP33.9GroundingDINO-L
3DMSCOCOAP52.5Grounding DINO-L (without COCO data)
3DODinWAverage Score26.1Grounding DINO
Zero Shot SegmentationSegmentation in the WildMean AP46Grounded-SAM
2D ClassificationCOCO test-devbox mAP63Grounding DINO
2D ClassificationODinW Full-Shot 13 TasksAP70.9Grounding DINO
2D ClassificationCOCO minivalbox AP63Grounding DINO
2D ClassificationLVIS v1.0 minivalAP33.9GroundingDINO-L
2D ClassificationMSCOCOAP52.5Grounding DINO-L (without COCO data)
2D ClassificationODinWAverage Score26.1Grounding DINO
2D Object DetectionCOCO test-devbox mAP63Grounding DINO
2D Object DetectionODinW Full-Shot 13 TasksAP70.9Grounding DINO
2D Object DetectionCOCO minivalbox AP63Grounding DINO
2D Object DetectionLVIS v1.0 minivalAP33.9GroundingDINO-L
2D Object DetectionMSCOCOAP52.5Grounding DINO-L (without COCO data)
2D Object DetectionODinWAverage Score26.1Grounding DINO
16kCOCO test-devbox mAP63Grounding DINO
16kODinW Full-Shot 13 TasksAP70.9Grounding DINO
16kCOCO minivalbox AP63Grounding DINO
16kLVIS v1.0 minivalAP33.9GroundingDINO-L
16kMSCOCOAP52.5Grounding DINO-L (without COCO data)
16kODinWAverage Score26.1Grounding DINO

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15Compress Any Segment Anything Model (SAM)2025-07-11ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08