TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Exploring Plain Vision Transformer Backbones for Object De...

Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He

2022-03-30Instance Segmentationobject-detectionCross-Domain Few-Shot Object DetectionObject Detection
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

Results

TaskDatasetMetricValueModel
Object DetectionCOCO-OAverage mAP34.3ViTDet (ViT-H)
Object DetectionCOCO-OEffective Robustness7.89ViTDet (ViT-H)
Object DetectionCOCO minivalbox AP61.3ViTDet, ViT-H Cascade (multiscale)
Object DetectionCOCO minivalbox AP60.4ViTDet, ViT-H Cascade
Object DetectionLVIS v1.0 valbox AP53.4ViTDet-H
Object DetectionLVIS v1.0 valbox AP51.2ViTDet-L
Object DetectionArtaxor mAP23.4ViTDeT-FT
Object DetectionNEU-DETmAP15.8ViTDeT-FT
Object DetectionDIORmAP29.4ViTDeT-FT
Object DetectionClipark1k mAP25.6ViTDeT-FT
Object DetectionDeepFishmAP6.5ViTDeT-FT
Object DetectionUODDmAP15.8ViTDeT-FT
3DCOCO-OAverage mAP34.3ViTDet (ViT-H)
3DCOCO-OEffective Robustness7.89ViTDet (ViT-H)
3DCOCO minivalbox AP61.3ViTDet, ViT-H Cascade (multiscale)
3DCOCO minivalbox AP60.4ViTDet, ViT-H Cascade
3DLVIS v1.0 valbox AP53.4ViTDet-H
3DLVIS v1.0 valbox AP51.2ViTDet-L
3DArtaxor mAP23.4ViTDeT-FT
3DNEU-DETmAP15.8ViTDeT-FT
3DDIORmAP29.4ViTDeT-FT
3DClipark1k mAP25.6ViTDeT-FT
3DDeepFishmAP6.5ViTDeT-FT
3DUODDmAP15.8ViTDeT-FT
Instance SegmentationCOCO minivalmask AP53.1ViTDet, ViT-H Cascade (multiscale)
Instance SegmentationCOCO minivalmask AP52ViTDet, ViT-H Cascade
Instance SegmentationLVIS v1.0 valmask AP48.1ViTDet-H
Instance SegmentationLVIS v1.0 valmask APr36.9ViTDet-H
Instance SegmentationLVIS v1.0 valmask AP46ViTDet-L
Instance SegmentationLVIS v1.0 valmask APr34.3ViTDet-L
Few-Shot Object DetectionArtaxor mAP23.4ViTDeT-FT
Few-Shot Object DetectionNEU-DETmAP15.8ViTDeT-FT
Few-Shot Object DetectionDIORmAP29.4ViTDeT-FT
Few-Shot Object DetectionClipark1k mAP25.6ViTDeT-FT
Few-Shot Object DetectionDeepFishmAP6.5ViTDeT-FT
Few-Shot Object DetectionUODDmAP15.8ViTDeT-FT
2D ClassificationCOCO-OAverage mAP34.3ViTDet (ViT-H)
2D ClassificationCOCO-OEffective Robustness7.89ViTDet (ViT-H)
2D ClassificationCOCO minivalbox AP61.3ViTDet, ViT-H Cascade (multiscale)
2D ClassificationCOCO minivalbox AP60.4ViTDet, ViT-H Cascade
2D ClassificationLVIS v1.0 valbox AP53.4ViTDet-H
2D ClassificationLVIS v1.0 valbox AP51.2ViTDet-L
2D ClassificationArtaxor mAP23.4ViTDeT-FT
2D ClassificationNEU-DETmAP15.8ViTDeT-FT
2D ClassificationDIORmAP29.4ViTDeT-FT
2D ClassificationClipark1k mAP25.6ViTDeT-FT
2D ClassificationDeepFishmAP6.5ViTDeT-FT
2D ClassificationUODDmAP15.8ViTDeT-FT
2D Object DetectionCOCO-OAverage mAP34.3ViTDet (ViT-H)
2D Object DetectionCOCO-OEffective Robustness7.89ViTDet (ViT-H)
2D Object DetectionCOCO minivalbox AP61.3ViTDet, ViT-H Cascade (multiscale)
2D Object DetectionCOCO minivalbox AP60.4ViTDet, ViT-H Cascade
2D Object DetectionLVIS v1.0 valbox AP53.4ViTDet-H
2D Object DetectionLVIS v1.0 valbox AP51.2ViTDet-L
2D Object DetectionArtaxor mAP23.4ViTDeT-FT
2D Object DetectionNEU-DETmAP15.8ViTDeT-FT
2D Object DetectionDIORmAP29.4ViTDeT-FT
2D Object DetectionClipark1k mAP25.6ViTDeT-FT
2D Object DetectionDeepFishmAP6.5ViTDeT-FT
2D Object DetectionUODDmAP15.8ViTDeT-FT
16kCOCO-OAverage mAP34.3ViTDet (ViT-H)
16kCOCO-OEffective Robustness7.89ViTDet (ViT-H)
16kCOCO minivalbox AP61.3ViTDet, ViT-H Cascade (multiscale)
16kCOCO minivalbox AP60.4ViTDet, ViT-H Cascade
16kLVIS v1.0 valbox AP53.4ViTDet-H
16kLVIS v1.0 valbox AP51.2ViTDet-L
16kArtaxor mAP23.4ViTDeT-FT
16kNEU-DETmAP15.8ViTDeT-FT
16kDIORmAP29.4ViTDeT-FT
16kClipark1k mAP25.6ViTDeT-FT
16kDeepFishmAP6.5ViTDeT-FT
16kUODDmAP15.8ViTDeT-FT

Related Papers

SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08