Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | COCO-O | Average mAP | 34.3 | ViTDet (ViT-H) |
| Object Detection | COCO-O | Effective Robustness | 7.89 | ViTDet (ViT-H) |
| Object Detection | COCO minival | box AP | 61.3 | ViTDet, ViT-H Cascade (multiscale) |
| Object Detection | COCO minival | box AP | 60.4 | ViTDet, ViT-H Cascade |
| Object Detection | LVIS v1.0 val | box AP | 53.4 | ViTDet-H |
| Object Detection | LVIS v1.0 val | box AP | 51.2 | ViTDet-L |
| Object Detection | Artaxor | mAP | 23.4 | ViTDeT-FT |
| Object Detection | NEU-DET | mAP | 15.8 | ViTDeT-FT |
| Object Detection | DIOR | mAP | 29.4 | ViTDeT-FT |
| Object Detection | Clipark1k | mAP | 25.6 | ViTDeT-FT |
| Object Detection | DeepFish | mAP | 6.5 | ViTDeT-FT |
| Object Detection | UODD | mAP | 15.8 | ViTDeT-FT |
| 3D | COCO-O | Average mAP | 34.3 | ViTDet (ViT-H) |
| 3D | COCO-O | Effective Robustness | 7.89 | ViTDet (ViT-H) |
| 3D | COCO minival | box AP | 61.3 | ViTDet, ViT-H Cascade (multiscale) |
| 3D | COCO minival | box AP | 60.4 | ViTDet, ViT-H Cascade |
| 3D | LVIS v1.0 val | box AP | 53.4 | ViTDet-H |
| 3D | LVIS v1.0 val | box AP | 51.2 | ViTDet-L |
| 3D | Artaxor | mAP | 23.4 | ViTDeT-FT |
| 3D | NEU-DET | mAP | 15.8 | ViTDeT-FT |
| 3D | DIOR | mAP | 29.4 | ViTDeT-FT |
| 3D | Clipark1k | mAP | 25.6 | ViTDeT-FT |
| 3D | DeepFish | mAP | 6.5 | ViTDeT-FT |
| 3D | UODD | mAP | 15.8 | ViTDeT-FT |
| Instance Segmentation | COCO minival | mask AP | 53.1 | ViTDet, ViT-H Cascade (multiscale) |
| Instance Segmentation | COCO minival | mask AP | 52 | ViTDet, ViT-H Cascade |
| Instance Segmentation | LVIS v1.0 val | mask AP | 48.1 | ViTDet-H |
| Instance Segmentation | LVIS v1.0 val | mask APr | 36.9 | ViTDet-H |
| Instance Segmentation | LVIS v1.0 val | mask AP | 46 | ViTDet-L |
| Instance Segmentation | LVIS v1.0 val | mask APr | 34.3 | ViTDet-L |
| Few-Shot Object Detection | Artaxor | mAP | 23.4 | ViTDeT-FT |
| Few-Shot Object Detection | NEU-DET | mAP | 15.8 | ViTDeT-FT |
| Few-Shot Object Detection | DIOR | mAP | 29.4 | ViTDeT-FT |
| Few-Shot Object Detection | Clipark1k | mAP | 25.6 | ViTDeT-FT |
| Few-Shot Object Detection | DeepFish | mAP | 6.5 | ViTDeT-FT |
| Few-Shot Object Detection | UODD | mAP | 15.8 | ViTDeT-FT |
| 2D Classification | COCO-O | Average mAP | 34.3 | ViTDet (ViT-H) |
| 2D Classification | COCO-O | Effective Robustness | 7.89 | ViTDet (ViT-H) |
| 2D Classification | COCO minival | box AP | 61.3 | ViTDet, ViT-H Cascade (multiscale) |
| 2D Classification | COCO minival | box AP | 60.4 | ViTDet, ViT-H Cascade |
| 2D Classification | LVIS v1.0 val | box AP | 53.4 | ViTDet-H |
| 2D Classification | LVIS v1.0 val | box AP | 51.2 | ViTDet-L |
| 2D Classification | Artaxor | mAP | 23.4 | ViTDeT-FT |
| 2D Classification | NEU-DET | mAP | 15.8 | ViTDeT-FT |
| 2D Classification | DIOR | mAP | 29.4 | ViTDeT-FT |
| 2D Classification | Clipark1k | mAP | 25.6 | ViTDeT-FT |
| 2D Classification | DeepFish | mAP | 6.5 | ViTDeT-FT |
| 2D Classification | UODD | mAP | 15.8 | ViTDeT-FT |
| 2D Object Detection | COCO-O | Average mAP | 34.3 | ViTDet (ViT-H) |
| 2D Object Detection | COCO-O | Effective Robustness | 7.89 | ViTDet (ViT-H) |
| 2D Object Detection | COCO minival | box AP | 61.3 | ViTDet, ViT-H Cascade (multiscale) |
| 2D Object Detection | COCO minival | box AP | 60.4 | ViTDet, ViT-H Cascade |
| 2D Object Detection | LVIS v1.0 val | box AP | 53.4 | ViTDet-H |
| 2D Object Detection | LVIS v1.0 val | box AP | 51.2 | ViTDet-L |
| 2D Object Detection | Artaxor | mAP | 23.4 | ViTDeT-FT |
| 2D Object Detection | NEU-DET | mAP | 15.8 | ViTDeT-FT |
| 2D Object Detection | DIOR | mAP | 29.4 | ViTDeT-FT |
| 2D Object Detection | Clipark1k | mAP | 25.6 | ViTDeT-FT |
| 2D Object Detection | DeepFish | mAP | 6.5 | ViTDeT-FT |
| 2D Object Detection | UODD | mAP | 15.8 | ViTDeT-FT |
| 16k | COCO-O | Average mAP | 34.3 | ViTDet (ViT-H) |
| 16k | COCO-O | Effective Robustness | 7.89 | ViTDet (ViT-H) |
| 16k | COCO minival | box AP | 61.3 | ViTDet, ViT-H Cascade (multiscale) |
| 16k | COCO minival | box AP | 60.4 | ViTDet, ViT-H Cascade |
| 16k | LVIS v1.0 val | box AP | 53.4 | ViTDet-H |
| 16k | LVIS v1.0 val | box AP | 51.2 | ViTDet-L |
| 16k | Artaxor | mAP | 23.4 | ViTDeT-FT |
| 16k | NEU-DET | mAP | 15.8 | ViTDeT-FT |
| 16k | DIOR | mAP | 29.4 | ViTDeT-FT |
| 16k | Clipark1k | mAP | 25.6 | ViTDeT-FT |
| 16k | DeepFish | mAP | 6.5 | ViTDeT-FT |
| 16k | UODD | mAP | 15.8 | ViTDeT-FT |