OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, YaoWei Wang, Xiangyuan Lan, Xiaodan Liang

2024-07-10Zero-Shot Object Detection Object Detection

Paper PDF Code(official)

Abstract

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.

Results

Task	Dataset	Metric	Value	Model
Object Detection	LVIS v1.0 minival	AP	40.1	OV-DINO-T (without LVIS data, swin tiny)
Object Detection	MSCOCO	AP	50.6	OV-DINO-T (without COCO data)
Object Detection	LVIS v1.0 val	AP	32.9	OV-DINO-T (without LVIS data, swin tiny)
3D	LVIS v1.0 minival	AP	40.1	OV-DINO-T (without LVIS data, swin tiny)
3D	MSCOCO	AP	50.6	OV-DINO-T (without COCO data)
3D	LVIS v1.0 val	AP	32.9	OV-DINO-T (without LVIS data, swin tiny)
2D Classification	LVIS v1.0 minival	AP	40.1	OV-DINO-T (without LVIS data, swin tiny)
2D Classification	MSCOCO	AP	50.6	OV-DINO-T (without COCO data)
2D Classification	LVIS v1.0 val	AP	32.9	OV-DINO-T (without LVIS data, swin tiny)
2D Object Detection	LVIS v1.0 minival	AP	40.1	OV-DINO-T (without LVIS data, swin tiny)
2D Object Detection	MSCOCO	AP	50.6	OV-DINO-T (without COCO data)
2D Object Detection	LVIS v1.0 val	AP	32.9	OV-DINO-T (without LVIS data, swin tiny)
16k	LVIS v1.0 minival	AP	40.1	OV-DINO-T (without LVIS data, swin tiny)
16k	MSCOCO	AP	50.6	OV-DINO-T (without COCO data)
16k	LVIS v1.0 val	AP	32.9	OV-DINO-T (without LVIS data, swin tiny)

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Abstract

Results

Related Papers

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Abstract

Results

Related Papers