TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Bridging the Gap between Object and Image-level Representa...

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan

2022-07-07Open Vocabulary Attribute DetectionZero-Shot Object DetectionOpen Vocabulary Object Detection
PaperPDFCode

Abstract

Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 36.6 AP50 on novel classes, an absolute 8.2 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall. Code: https://github.com/hanoonaR/object-centric-ovd.

Results

TaskDatasetMetricValueModel
Object DetectionObjects365mask AP5022.3Object-Centric-OVD
Object DetectionLVIS v1.0AP novel-LVIS base training21.1Object-Centric-OVD
Object DetectionMSCOCOAP 0.536.9Object-Centric-OVD
Object DetectionOpenImages-v4mask AP5042.9Object-Centric-OVD
Object DetectionOVAD benchmarkmean average precision14.6Object-Centric-OVD (ResNet50)
Object DetectionMSCOCOAP40.5Object-Centric-OVD
3DObjects365mask AP5022.3Object-Centric-OVD
3DLVIS v1.0AP novel-LVIS base training21.1Object-Centric-OVD
3DMSCOCOAP 0.536.9Object-Centric-OVD
3DOpenImages-v4mask AP5042.9Object-Centric-OVD
3DOVAD benchmarkmean average precision14.6Object-Centric-OVD (ResNet50)
3DMSCOCOAP40.5Object-Centric-OVD
2D ClassificationObjects365mask AP5022.3Object-Centric-OVD
2D ClassificationLVIS v1.0AP novel-LVIS base training21.1Object-Centric-OVD
2D ClassificationMSCOCOAP 0.536.9Object-Centric-OVD
2D ClassificationOpenImages-v4mask AP5042.9Object-Centric-OVD
2D ClassificationOVAD benchmarkmean average precision14.6Object-Centric-OVD (ResNet50)
2D ClassificationMSCOCOAP40.5Object-Centric-OVD
2D Object DetectionObjects365mask AP5022.3Object-Centric-OVD
2D Object DetectionLVIS v1.0AP novel-LVIS base training21.1Object-Centric-OVD
2D Object DetectionMSCOCOAP 0.536.9Object-Centric-OVD
2D Object DetectionOpenImages-v4mask AP5042.9Object-Centric-OVD
2D Object DetectionOVAD benchmarkmean average precision14.6Object-Centric-OVD (ResNet50)
2D Object DetectionMSCOCOAP40.5Object-Centric-OVD
Open Vocabulary Object DetectionObjects365mask AP5022.3Object-Centric-OVD
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training21.1Object-Centric-OVD
Open Vocabulary Object DetectionMSCOCOAP 0.536.9Object-Centric-OVD
Open Vocabulary Object DetectionOpenImages-v4mask AP5042.9Object-Centric-OVD
Open Vocabulary Object DetectionOVAD benchmarkmean average precision14.6Object-Centric-OVD (ResNet50)
16kObjects365mask AP5022.3Object-Centric-OVD
16kLVIS v1.0AP novel-LVIS base training21.1Object-Centric-OVD
16kMSCOCOAP 0.536.9Object-Centric-OVD
16kOpenImages-v4mask AP5042.9Object-Centric-OVD
16kOVAD benchmarkmean average precision14.6Object-Centric-OVD (ResNet50)
16kMSCOCOAP40.5Object-Centric-OVD

Related Papers

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction2025-06-10Gen-n-Val: Agentic Image Data Generation and Validation2025-06-05From Data to Modeling: Fully Open-vocabulary Scene Graph Generation2025-05-26VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning2025-05-17FG-CLIP: Fine-Grained Visual and Textual Alignment2025-05-08Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety2025-04-18VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model2025-04-10Finding the Reflection Point: Unpadding Images to Remove Data Augmentation Artifacts in Large Open Source Image Datasets for Machine Learning2025-04-04