Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, YaoWei Wang, Xiangyuan Lan, Xiaodan Liang
Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | LVIS v1.0 minival | AP | 40.1 | OV-DINO-T (without LVIS data, swin tiny) |
| Object Detection | MSCOCO | AP | 50.6 | OV-DINO-T (without COCO data) |
| Object Detection | LVIS v1.0 val | AP | 32.9 | OV-DINO-T (without LVIS data, swin tiny) |
| 3D | LVIS v1.0 minival | AP | 40.1 | OV-DINO-T (without LVIS data, swin tiny) |
| 3D | MSCOCO | AP | 50.6 | OV-DINO-T (without COCO data) |
| 3D | LVIS v1.0 val | AP | 32.9 | OV-DINO-T (without LVIS data, swin tiny) |
| 2D Classification | LVIS v1.0 minival | AP | 40.1 | OV-DINO-T (without LVIS data, swin tiny) |
| 2D Classification | MSCOCO | AP | 50.6 | OV-DINO-T (without COCO data) |
| 2D Classification | LVIS v1.0 val | AP | 32.9 | OV-DINO-T (without LVIS data, swin tiny) |
| 2D Object Detection | LVIS v1.0 minival | AP | 40.1 | OV-DINO-T (without LVIS data, swin tiny) |
| 2D Object Detection | MSCOCO | AP | 50.6 | OV-DINO-T (without COCO data) |
| 2D Object Detection | LVIS v1.0 val | AP | 32.9 | OV-DINO-T (without LVIS data, swin tiny) |
| 16k | LVIS v1.0 minival | AP | 40.1 | OV-DINO-T (without LVIS data, swin tiny) |
| 16k | MSCOCO | AP | 50.6 | OV-DINO-T (without COCO data) |
| 16k | LVIS v1.0 val | AP | 32.9 | OV-DINO-T (without LVIS data, swin tiny) |