Scaling Open-Vocabulary Object Detection

Matthias Minderer, Alexey Gritsenko, Neil Houlsby

2023-06-16NeurIPS 2023 11Image Classification Zero-Shot Object Detection Open Vocabulary Object Detection object-detection Object Detection Language Modelling

Paper PDF Code Code(official)Code

Abstract

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

Results

Task	Dataset	Metric	Value	Model
Object Detection	LVIS v1.0 minival	AP	51.3	OWLv2 (OWL-ST+FT)
Object Detection	LVIS v1.0 val	AP	47	OWLv2 (OWL-ST+FT)
3D	LVIS v1.0 minival	AP	51.3	OWLv2 (OWL-ST+FT)
3D	LVIS v1.0 val	AP	47	OWLv2 (OWL-ST+FT)
2D Classification	LVIS v1.0 minival	AP	51.3	OWLv2 (OWL-ST+FT)
2D Classification	LVIS v1.0 val	AP	47	OWLv2 (OWL-ST+FT)
2D Object Detection	LVIS v1.0 minival	AP	51.3	OWLv2 (OWL-ST+FT)
2D Object Detection	LVIS v1.0 val	AP	47	OWLv2 (OWL-ST+FT)
16k	LVIS v1.0 minival	AP	51.3	OWLv2 (OWL-ST+FT)
16k	LVIS v1.0 val	AP	47	OWLv2 (OWL-ST+FT)

Scaling Open-Vocabulary Object Detection

Abstract

Results

Related Papers

Scaling Open-Vocabulary Object Detection

Abstract

Results

Related Papers