Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang
Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | Description Detection Dataset | Intra-scenario ABS mAP | 26 | MM-Grounding-DINO |
| Object Detection | Description Detection Dataset | Intra-scenario FULL mAP | 22.9 | MM-Grounding-DINO |
| Object Detection | Description Detection Dataset | Intra-scenario PRES mAP | 21.9 | MM-Grounding-DINO |
| 3D | Description Detection Dataset | Intra-scenario ABS mAP | 26 | MM-Grounding-DINO |
| 3D | Description Detection Dataset | Intra-scenario FULL mAP | 22.9 | MM-Grounding-DINO |
| 3D | Description Detection Dataset | Intra-scenario PRES mAP | 21.9 | MM-Grounding-DINO |
| 2D Classification | Description Detection Dataset | Intra-scenario ABS mAP | 26 | MM-Grounding-DINO |
| 2D Classification | Description Detection Dataset | Intra-scenario FULL mAP | 22.9 | MM-Grounding-DINO |
| 2D Classification | Description Detection Dataset | Intra-scenario PRES mAP | 21.9 | MM-Grounding-DINO |
| 2D Object Detection | Description Detection Dataset | Intra-scenario ABS mAP | 26 | MM-Grounding-DINO |
| 2D Object Detection | Description Detection Dataset | Intra-scenario FULL mAP | 22.9 | MM-Grounding-DINO |
| 2D Object Detection | Description Detection Dataset | Intra-scenario PRES mAP | 21.9 | MM-Grounding-DINO |
| 16k | Description Detection Dataset | Intra-scenario ABS mAP | 26 | MM-Grounding-DINO |
| 16k | Description Detection Dataset | Intra-scenario FULL mAP | 22.9 | MM-Grounding-DINO |
| 16k | Description Detection Dataset | Intra-scenario PRES mAP | 21.9 | MM-Grounding-DINO |