Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang

2023-03-09Referring Expression Referring Expression Comprehension Zero Shot Segmentation Zero-Shot Object Detection Object Detection

Paper PDF Code Code(official)Code Code Code Code Code Code Code Code

Abstract

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.

Results

Task	Dataset	Metric	Value	Model
Object Detection	COCO test-dev	box mAP	63	Grounding DINO
Object Detection	ODinW Full-Shot 13 Tasks	AP	70.9	Grounding DINO
Object Detection	COCO minival	box AP	63	Grounding DINO
Object Detection	LVIS v1.0 minival	AP	33.9	GroundingDINO-L
Object Detection	MSCOCO	AP	52.5	Grounding DINO-L (without COCO data)
Object Detection	ODinW	Average Score	26.1	Grounding DINO
3D	COCO test-dev	box mAP	63	Grounding DINO
3D	ODinW Full-Shot 13 Tasks	AP	70.9	Grounding DINO
3D	COCO minival	box AP	63	Grounding DINO
3D	LVIS v1.0 minival	AP	33.9	GroundingDINO-L
3D	MSCOCO	AP	52.5	Grounding DINO-L (without COCO data)
3D	ODinW	Average Score	26.1	Grounding DINO
Zero Shot Segmentation	Segmentation in the Wild	Mean AP	46	Grounded-SAM
2D Classification	COCO test-dev	box mAP	63	Grounding DINO
2D Classification	ODinW Full-Shot 13 Tasks	AP	70.9	Grounding DINO
2D Classification	COCO minival	box AP	63	Grounding DINO
2D Classification	LVIS v1.0 minival	AP	33.9	GroundingDINO-L
2D Classification	MSCOCO	AP	52.5	Grounding DINO-L (without COCO data)
2D Classification	ODinW	Average Score	26.1	Grounding DINO
2D Object Detection	COCO test-dev	box mAP	63	Grounding DINO
2D Object Detection	ODinW Full-Shot 13 Tasks	AP	70.9	Grounding DINO
2D Object Detection	COCO minival	box AP	63	Grounding DINO
2D Object Detection	LVIS v1.0 minival	AP	33.9	GroundingDINO-L
2D Object Detection	MSCOCO	AP	52.5	Grounding DINO-L (without COCO data)
2D Object Detection	ODinW	Average Score	26.1	Grounding DINO
16k	COCO test-dev	box mAP	63	Grounding DINO
16k	ODinW Full-Shot 13 Tasks	AP	70.9	Grounding DINO
16k	COCO minival	box AP	63	Grounding DINO
16k	LVIS v1.0 minival	AP	33.9	GroundingDINO-L
16k	MSCOCO	AP	52.5	Grounding DINO-L (without COCO data)
16k	ODinW	Average Score	26.1	Grounding DINO

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Abstract

Results

Related Papers

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Abstract

Results

Related Papers