TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Scaling Open-Vocabulary Object Detection

Scaling Open-Vocabulary Object Detection

Matthias Minderer, Alexey Gritsenko, Neil Houlsby

2023-06-16NeurIPS 2023 11Image ClassificationZero-Shot Object DetectionOpen Vocabulary Object Detectionobject-detectionObject DetectionLanguage Modelling
PaperPDFCodeCode(official)Code

Abstract

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0 minivalAP51.3OWLv2 (OWL-ST+FT)
Object DetectionLVIS v1.0 valAP47OWLv2 (OWL-ST+FT)
3DLVIS v1.0 minivalAP51.3OWLv2 (OWL-ST+FT)
3DLVIS v1.0 valAP47OWLv2 (OWL-ST+FT)
2D ClassificationLVIS v1.0 minivalAP51.3OWLv2 (OWL-ST+FT)
2D ClassificationLVIS v1.0 valAP47OWLv2 (OWL-ST+FT)
2D Object DetectionLVIS v1.0 minivalAP51.3OWLv2 (OWL-ST+FT)
2D Object DetectionLVIS v1.0 valAP47OWLv2 (OWL-ST+FT)
16kLVIS v1.0 minivalAP51.3OWLv2 (OWL-ST+FT)
16kLVIS v1.0 valAP47OWLv2 (OWL-ST+FT)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17