TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Open-Vocabulary DETR with Conditional Matching

Open-Vocabulary DETR with Conditional Matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy

2022-03-22Open Vocabulary Object Detectionobject-detectionObject DetectionLanguage Modelling
PaperPDFCodeCode(official)CodeCode

Abstract

Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR -- the first end-to-end Transformer-based open-vocabulary detector -- achieves non-trivial improvements over current state of the arts.

Results

TaskDatasetMetricValueModel
Object DetectionMSCOCOAP 0.529.4OV-DERT
3DMSCOCOAP 0.529.4OV-DERT
2D ClassificationMSCOCOAP 0.529.4OV-DERT
2D Object DetectionMSCOCOAP 0.529.4OV-DERT
Open Vocabulary Object DetectionMSCOCOAP 0.529.4OV-DERT
16kMSCOCOAP 0.529.4OV-DERT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17