TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Simple Open-Vocabulary Object Detection with Vision Transf...

Simple Open-Vocabulary Object Detection with Vision Transformers

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby

2022-05-12Described Object DetectionImage ClassificationOpen Vocabulary Object DetectionObject DetectionOne-Shot Object Detection
PaperPDFCodeCode(official)

Abstract

Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0AP novel-LVIS base training25.6OWL-ViT (CLIP-L/14)
Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training31.2OWL-ViT (CLIP-L/14)
Object DetectionCOCO (Common Objects in Context)AP 0.541.8OWL-ViT (R50+H/32)
Object DetectionDescription Detection DatasetIntra-scenario ABS mAP8.8OWL-ViT-base
Object DetectionDescription Detection DatasetIntra-scenario FULL mAP8.6OWL-ViT-base
Object DetectionDescription Detection DatasetIntra-scenario PRES mAP8.5OWL-ViT-base
3DLVIS v1.0AP novel-LVIS base training25.6OWL-ViT (CLIP-L/14)
3DLVIS v1.0AP novel-Unrestricted open-vocabulary training31.2OWL-ViT (CLIP-L/14)
3DCOCO (Common Objects in Context)AP 0.541.8OWL-ViT (R50+H/32)
3DDescription Detection DatasetIntra-scenario ABS mAP8.8OWL-ViT-base
3DDescription Detection DatasetIntra-scenario FULL mAP8.6OWL-ViT-base
3DDescription Detection DatasetIntra-scenario PRES mAP8.5OWL-ViT-base
2D ClassificationLVIS v1.0AP novel-LVIS base training25.6OWL-ViT (CLIP-L/14)
2D ClassificationLVIS v1.0AP novel-Unrestricted open-vocabulary training31.2OWL-ViT (CLIP-L/14)
2D ClassificationCOCO (Common Objects in Context)AP 0.541.8OWL-ViT (R50+H/32)
2D ClassificationDescription Detection DatasetIntra-scenario ABS mAP8.8OWL-ViT-base
2D ClassificationDescription Detection DatasetIntra-scenario FULL mAP8.6OWL-ViT-base
2D ClassificationDescription Detection DatasetIntra-scenario PRES mAP8.5OWL-ViT-base
2D Object DetectionLVIS v1.0AP novel-LVIS base training25.6OWL-ViT (CLIP-L/14)
2D Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training31.2OWL-ViT (CLIP-L/14)
2D Object DetectionCOCO (Common Objects in Context)AP 0.541.8OWL-ViT (R50+H/32)
2D Object DetectionDescription Detection DatasetIntra-scenario ABS mAP8.8OWL-ViT-base
2D Object DetectionDescription Detection DatasetIntra-scenario FULL mAP8.6OWL-ViT-base
2D Object DetectionDescription Detection DatasetIntra-scenario PRES mAP8.5OWL-ViT-base
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training25.6OWL-ViT (CLIP-L/14)
Open Vocabulary Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training31.2OWL-ViT (CLIP-L/14)
16kLVIS v1.0AP novel-LVIS base training25.6OWL-ViT (CLIP-L/14)
16kLVIS v1.0AP novel-Unrestricted open-vocabulary training31.2OWL-ViT (CLIP-L/14)
16kCOCO (Common Objects in Context)AP 0.541.8OWL-ViT (R50+H/32)
16kDescription Detection DatasetIntra-scenario ABS mAP8.8OWL-ViT-base
16kDescription Detection DatasetIntra-scenario FULL mAP8.6OWL-ViT-base
16kDescription Detection DatasetIntra-scenario PRES mAP8.5OWL-ViT-base

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17