Vision Transformer Adapter for Dense Predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao

2022-05-17Panoptic Segmentation Real-Time Object Detection Semantic Segmentation Instance Segmentation Object Detection

Abstract

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	mIoU	85.8	ViT-Adapter-L
Semantic Segmentation	ADE20K val	mIoU	60.5	ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic Segmentation	ADE20K val	mIoU	58.4	ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic Segmentation	PASCAL Context	mIoU	68.2	ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic Segmentation	PASCAL Context	mIoU	67.5	ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic Segmentation	ADE20K	Params (M)	571	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	61.5	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
Semantic Segmentation	ADE20K	Params (M)	571	ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	60.5	ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic Segmentation	ADE20K	Params (M)	451	ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	58.4	ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic Segmentation	COCO minival	AP	48.9	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic Segmentation	COCO minival	PQ	58.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic Segmentation	COCO minival	PQst	48.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic Segmentation	COCO minival	PQth	65	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Object Detection	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Object Detection	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Object Detection	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
Object Detection	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
Object Detection	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Object Detection	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
3D	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
3D	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
3D	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
3D	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
3D	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
3D	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Instance Segmentation	COCO minival	mask AP	54.2	ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)
Instance Segmentation	COCO minival	mask AP	52.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Instance Segmentation	COCO minival	mask AP	52.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Instance Segmentation	COCO test-dev	mask AP	54.5	ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)
Instance Segmentation	COCO test-dev	mask AP	53	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Instance Segmentation	COCO test-dev	mask AP	52.5	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Classification	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Classification	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Classification	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
2D Classification	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
2D Classification	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Classification	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Object Detection	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Object Detection	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Object Detection	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
2D Object Detection	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
2D Object Detection	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Object Detection	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
10-shot image generation	Cityscapes val	mIoU	85.8	ViT-Adapter-L
10-shot image generation	ADE20K val	mIoU	60.5	ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generation	ADE20K val	mIoU	58.4	ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generation	PASCAL Context	mIoU	68.2	ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generation	PASCAL Context	mIoU	67.5	ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generation	ADE20K	Params (M)	571	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
10-shot image generation	ADE20K	Validation mIoU	61.5	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
10-shot image generation	ADE20K	Params (M)	571	ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generation	ADE20K	Validation mIoU	60.5	ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generation	ADE20K	Params (M)	451	ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generation	ADE20K	Validation mIoU	58.4	ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generation	COCO minival	AP	48.9	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generation	COCO minival	PQ	58.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generation	COCO minival	PQst	48.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generation	COCO minival	PQth	65	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic Segmentation	COCO minival	AP	48.9	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic Segmentation	COCO minival	PQ	58.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic Segmentation	COCO minival	PQst	48.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic Segmentation	COCO minival	PQth	65	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
16k	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
16k	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
16k	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
16k	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
16k	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
16k	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	mIoU	85.8	ViT-Adapter-L
Semantic Segmentation	ADE20K val	mIoU	60.5	ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic Segmentation	ADE20K val	mIoU	58.4	ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic Segmentation	PASCAL Context	mIoU	68.2	ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic Segmentation	PASCAL Context	mIoU	67.5	ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic Segmentation	ADE20K	Params (M)	571	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	61.5	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
Semantic Segmentation	ADE20K	Params (M)	571	ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	60.5	ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic Segmentation	ADE20K	Params (M)	451	ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	58.4	ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic Segmentation	COCO minival	AP	48.9	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic Segmentation	COCO minival	PQ	58.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic Segmentation	COCO minival	PQst	48.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic Segmentation	COCO minival	PQth	65	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Object Detection	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Object Detection	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Object Detection	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
Object Detection	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
Object Detection	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Object Detection	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
3D	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
3D	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
3D	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
3D	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
3D	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
3D	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Instance Segmentation	COCO minival	mask AP	54.2	ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)
Instance Segmentation	COCO minival	mask AP	52.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Instance Segmentation	COCO minival	mask AP	52.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Instance Segmentation	COCO test-dev	mask AP	54.5	ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)
Instance Segmentation	COCO test-dev	mask AP	53	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Instance Segmentation	COCO test-dev	mask AP	52.5	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Classification	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Classification	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Classification	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
2D Classification	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
2D Classification	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Classification	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Object Detection	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Object Detection	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Object Detection	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
2D Object Detection	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
2D Object Detection	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Object Detection	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
10-shot image generation	Cityscapes val	mIoU	85.8	ViT-Adapter-L
10-shot image generation	ADE20K val	mIoU	60.5	ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generation	ADE20K val	mIoU	58.4	ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generation	PASCAL Context	mIoU	68.2	ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generation	PASCAL Context	mIoU	67.5	ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generation	ADE20K	Params (M)	571	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
10-shot image generation	ADE20K	Validation mIoU	61.5	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
10-shot image generation	ADE20K	Params (M)	571	ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generation	ADE20K	Validation mIoU	60.5	ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generation	ADE20K	Params (M)	451	ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generation	ADE20K	Validation mIoU	58.4	ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generation	COCO minival	AP	48.9	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generation	COCO minival	PQ	58.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generation	COCO minival	PQst	48.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generation	COCO minival	PQth	65	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic Segmentation	COCO minival	AP	48.9	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic Segmentation	COCO minival	PQ	58.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic Segmentation	COCO minival	PQst	48.4	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic Segmentation	COCO minival	PQth	65	ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
16k	COCO test-dev	box mAP	60.9	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
16k	COCO test-dev	box mAP	60.4	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
16k	COCO-O	Average mAP	34.25	ViT-Adapter (BEiTv2-L)
16k	COCO-O	Effective Robustness	7.79	ViT-Adapter (BEiTv2-L)
16k	COCO minival	box AP	60.5	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
16k	COCO minival	box AP	60.2	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)

Vision Transformer Adapter for Dense Predictions

Abstract

Results

Related Papers

Vision Transformer Adapter for Dense Predictions

Abstract

Results

Related Papers