Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He

2022-03-30Instance Segmentation object-detection Cross-Domain Few-Shot Object Detection Object Detection

Paper PDF Code Code Code Code Code Code Code Code Code Code Code(official)

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

Results

Task	Dataset	Metric	Value	Model
Object Detection	COCO-O	Average mAP	34.3	ViTDet (ViT-H)
Object Detection	COCO-O	Effective Robustness	7.89	ViTDet (ViT-H)
Object Detection	COCO minival	box AP	61.3	ViTDet, ViT-H Cascade (multiscale)
Object Detection	COCO minival	box AP	60.4	ViTDet, ViT-H Cascade
Object Detection	LVIS v1.0 val	box AP	53.4	ViTDet-H
Object Detection	LVIS v1.0 val	box AP	51.2	ViTDet-L
Object Detection	Artaxor	mAP	23.4	ViTDeT-FT
Object Detection	NEU-DET	mAP	15.8	ViTDeT-FT
Object Detection	DIOR	mAP	29.4	ViTDeT-FT
Object Detection	Clipark1k	mAP	25.6	ViTDeT-FT
Object Detection	DeepFish	mAP	6.5	ViTDeT-FT
Object Detection	UODD	mAP	15.8	ViTDeT-FT
3D	COCO-O	Average mAP	34.3	ViTDet (ViT-H)
3D	COCO-O	Effective Robustness	7.89	ViTDet (ViT-H)
3D	COCO minival	box AP	61.3	ViTDet, ViT-H Cascade (multiscale)
3D	COCO minival	box AP	60.4	ViTDet, ViT-H Cascade
3D	LVIS v1.0 val	box AP	53.4	ViTDet-H
3D	LVIS v1.0 val	box AP	51.2	ViTDet-L
3D	Artaxor	mAP	23.4	ViTDeT-FT
3D	NEU-DET	mAP	15.8	ViTDeT-FT
3D	DIOR	mAP	29.4	ViTDeT-FT
3D	Clipark1k	mAP	25.6	ViTDeT-FT
3D	DeepFish	mAP	6.5	ViTDeT-FT
3D	UODD	mAP	15.8	ViTDeT-FT
Instance Segmentation	COCO minival	mask AP	53.1	ViTDet, ViT-H Cascade (multiscale)
Instance Segmentation	COCO minival	mask AP	52	ViTDet, ViT-H Cascade
Instance Segmentation	LVIS v1.0 val	mask AP	48.1	ViTDet-H
Instance Segmentation	LVIS v1.0 val	mask APr	36.9	ViTDet-H
Instance Segmentation	LVIS v1.0 val	mask AP	46	ViTDet-L
Instance Segmentation	LVIS v1.0 val	mask APr	34.3	ViTDet-L
Few-Shot Object Detection	Artaxor	mAP	23.4	ViTDeT-FT
Few-Shot Object Detection	NEU-DET	mAP	15.8	ViTDeT-FT
Few-Shot Object Detection	DIOR	mAP	29.4	ViTDeT-FT
Few-Shot Object Detection	Clipark1k	mAP	25.6	ViTDeT-FT
Few-Shot Object Detection	DeepFish	mAP	6.5	ViTDeT-FT
Few-Shot Object Detection	UODD	mAP	15.8	ViTDeT-FT
2D Classification	COCO-O	Average mAP	34.3	ViTDet (ViT-H)
2D Classification	COCO-O	Effective Robustness	7.89	ViTDet (ViT-H)
2D Classification	COCO minival	box AP	61.3	ViTDet, ViT-H Cascade (multiscale)
2D Classification	COCO minival	box AP	60.4	ViTDet, ViT-H Cascade
2D Classification	LVIS v1.0 val	box AP	53.4	ViTDet-H
2D Classification	LVIS v1.0 val	box AP	51.2	ViTDet-L
2D Classification	Artaxor	mAP	23.4	ViTDeT-FT
2D Classification	NEU-DET	mAP	15.8	ViTDeT-FT
2D Classification	DIOR	mAP	29.4	ViTDeT-FT
2D Classification	Clipark1k	mAP	25.6	ViTDeT-FT
2D Classification	DeepFish	mAP	6.5	ViTDeT-FT
2D Classification	UODD	mAP	15.8	ViTDeT-FT
2D Object Detection	COCO-O	Average mAP	34.3	ViTDet (ViT-H)
2D Object Detection	COCO-O	Effective Robustness	7.89	ViTDet (ViT-H)
2D Object Detection	COCO minival	box AP	61.3	ViTDet, ViT-H Cascade (multiscale)
2D Object Detection	COCO minival	box AP	60.4	ViTDet, ViT-H Cascade
2D Object Detection	LVIS v1.0 val	box AP	53.4	ViTDet-H
2D Object Detection	LVIS v1.0 val	box AP	51.2	ViTDet-L
2D Object Detection	Artaxor	mAP	23.4	ViTDeT-FT
2D Object Detection	NEU-DET	mAP	15.8	ViTDeT-FT
2D Object Detection	DIOR	mAP	29.4	ViTDeT-FT
2D Object Detection	Clipark1k	mAP	25.6	ViTDeT-FT
2D Object Detection	DeepFish	mAP	6.5	ViTDeT-FT
2D Object Detection	UODD	mAP	15.8	ViTDeT-FT
16k	COCO-O	Average mAP	34.3	ViTDet (ViT-H)
16k	COCO-O	Effective Robustness	7.89	ViTDet (ViT-H)
16k	COCO minival	box AP	61.3	ViTDet, ViT-H Cascade (multiscale)
16k	COCO minival	box AP	60.4	ViTDet, ViT-H Cascade
16k	LVIS v1.0 val	box AP	53.4	ViTDet-H
16k	LVIS v1.0 val	box AP	51.2	ViTDet-L
16k	Artaxor	mAP	23.4	ViTDeT-FT
16k	NEU-DET	mAP	15.8	ViTDeT-FT
16k	DIOR	mAP	29.4	ViTDeT-FT
16k	Clipark1k	mAP	25.6	ViTDeT-FT
16k	DeepFish	mAP	6.5	ViTDeT-FT
16k	UODD	mAP	15.8	ViTDeT-FT

Exploring Plain Vision Transformer Backbones for Object Detection

Abstract

Results

Related Papers

Exploring Plain Vision Transformer Backbones for Object Detection

Abstract

Results

Related Papers