TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/An Open and Comprehensive Pipeline for Unified Object Grou...

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang

2024-01-04Described Object DetectionReferring ExpressionReferring Expression ComprehensionPhrase Grounding
PaperPDFCodeCode(official)

Abstract

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.

Results

TaskDatasetMetricValueModel
Object DetectionDescription Detection DatasetIntra-scenario ABS mAP26MM-Grounding-DINO
Object DetectionDescription Detection DatasetIntra-scenario FULL mAP22.9MM-Grounding-DINO
Object DetectionDescription Detection DatasetIntra-scenario PRES mAP21.9MM-Grounding-DINO
3DDescription Detection DatasetIntra-scenario ABS mAP26MM-Grounding-DINO
3DDescription Detection DatasetIntra-scenario FULL mAP22.9MM-Grounding-DINO
3DDescription Detection DatasetIntra-scenario PRES mAP21.9MM-Grounding-DINO
2D ClassificationDescription Detection DatasetIntra-scenario ABS mAP26MM-Grounding-DINO
2D ClassificationDescription Detection DatasetIntra-scenario FULL mAP22.9MM-Grounding-DINO
2D ClassificationDescription Detection DatasetIntra-scenario PRES mAP21.9MM-Grounding-DINO
2D Object DetectionDescription Detection DatasetIntra-scenario ABS mAP26MM-Grounding-DINO
2D Object DetectionDescription Detection DatasetIntra-scenario FULL mAP22.9MM-Grounding-DINO
2D Object DetectionDescription Detection DatasetIntra-scenario PRES mAP21.9MM-Grounding-DINO
16kDescription Detection DatasetIntra-scenario ABS mAP26MM-Grounding-DINO
16kDescription Detection DatasetIntra-scenario FULL mAP22.9MM-Grounding-DINO
16kDescription Detection DatasetIntra-scenario PRES mAP21.9MM-Grounding-DINO

Related Papers

Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models2025-06-26Referring Expression Instance Retrieval and A Strong End-to-End Baseline2025-06-23Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation2025-06-12Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models2025-06-12Synthetic Visual Genome2025-06-09From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes2025-06-05Refer to Anything with Vision-Language Prompts2025-06-05