TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OV-DINO: Unified Open-Vocabulary Detection with Language-A...

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, YaoWei Wang, Xiangyuan Lan, Xiaodan Liang

2024-07-10Zero-Shot Object DetectionObject Detection
PaperPDFCode(official)

Abstract

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0 minivalAP40.1OV-DINO-T (without LVIS data, swin tiny)
Object DetectionMSCOCOAP50.6OV-DINO-T (without COCO data)
Object DetectionLVIS v1.0 valAP32.9OV-DINO-T (without LVIS data, swin tiny)
3DLVIS v1.0 minivalAP40.1OV-DINO-T (without LVIS data, swin tiny)
3DMSCOCOAP50.6OV-DINO-T (without COCO data)
3DLVIS v1.0 valAP32.9OV-DINO-T (without LVIS data, swin tiny)
2D ClassificationLVIS v1.0 minivalAP40.1OV-DINO-T (without LVIS data, swin tiny)
2D ClassificationMSCOCOAP50.6OV-DINO-T (without COCO data)
2D ClassificationLVIS v1.0 valAP32.9OV-DINO-T (without LVIS data, swin tiny)
2D Object DetectionLVIS v1.0 minivalAP40.1OV-DINO-T (without LVIS data, swin tiny)
2D Object DetectionMSCOCOAP50.6OV-DINO-T (without COCO data)
2D Object DetectionLVIS v1.0 valAP32.9OV-DINO-T (without LVIS data, swin tiny)
16kLVIS v1.0 minivalAP40.1OV-DINO-T (without LVIS data, swin tiny)
16kMSCOCOAP50.6OV-DINO-T (without COCO data)
16kLVIS v1.0 valAP32.9OV-DINO-T (without LVIS data, swin tiny)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge2025-07-08Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07