Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao
This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | Cityscapes val | mIoU | 85.8 | ViT-Adapter-L |
| Semantic Segmentation | ADE20K val | mIoU | 60.5 | ViT-Adapter-L (Mask2Former, BEiT pretrain) |
| Semantic Segmentation | ADE20K val | mIoU | 58.4 | ViT-Adapter-L (UperNet, BEiT pretrain) |
| Semantic Segmentation | PASCAL Context | mIoU | 68.2 | ViT-Adapter-L (Mask2Former, BEiT pretrain) |
| Semantic Segmentation | PASCAL Context | mIoU | 67.5 | ViT-Adapter-L (UperNet, BEiT pretrain) |
| Semantic Segmentation | ADE20K | Params (M) | 571 | ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) |
| Semantic Segmentation | ADE20K | Validation mIoU | 61.5 | ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) |
| Semantic Segmentation | ADE20K | Params (M) | 571 | ViT-Adapter-L (Mask2Former, BEiT pretrain) |
| Semantic Segmentation | ADE20K | Validation mIoU | 60.5 | ViT-Adapter-L (Mask2Former, BEiT pretrain) |
| Semantic Segmentation | ADE20K | Params (M) | 451 | ViT-Adapter-L (UperNet, BEiT pretrain) |
| Semantic Segmentation | ADE20K | Validation mIoU | 58.4 | ViT-Adapter-L (UperNet, BEiT pretrain) |
| Semantic Segmentation | COCO minival | AP | 48.9 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| Semantic Segmentation | COCO minival | PQ | 58.4 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| Semantic Segmentation | COCO minival | PQst | 48.4 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| Semantic Segmentation | COCO minival | PQth | 65 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| Object Detection | COCO test-dev | box mAP | 60.9 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| Object Detection | COCO test-dev | box mAP | 60.4 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| Object Detection | COCO-O | Average mAP | 34.25 | ViT-Adapter (BEiTv2-L) |
| Object Detection | COCO-O | Effective Robustness | 7.79 | ViT-Adapter (BEiTv2-L) |
| Object Detection | COCO minival | box AP | 60.5 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| Object Detection | COCO minival | box AP | 60.2 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| 3D | COCO test-dev | box mAP | 60.9 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| 3D | COCO test-dev | box mAP | 60.4 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| 3D | COCO-O | Average mAP | 34.25 | ViT-Adapter (BEiTv2-L) |
| 3D | COCO-O | Effective Robustness | 7.79 | ViT-Adapter (BEiTv2-L) |
| 3D | COCO minival | box AP | 60.5 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| 3D | COCO minival | box AP | 60.2 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| Instance Segmentation | COCO minival | mask AP | 54.2 | ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) |
| Instance Segmentation | COCO minival | mask AP | 52.5 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| Instance Segmentation | COCO minival | mask AP | 52.2 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| Instance Segmentation | COCO test-dev | mask AP | 54.5 | ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) |
| Instance Segmentation | COCO test-dev | mask AP | 53 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| Instance Segmentation | COCO test-dev | mask AP | 52.5 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| 2D Classification | COCO test-dev | box mAP | 60.9 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| 2D Classification | COCO test-dev | box mAP | 60.4 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| 2D Classification | COCO-O | Average mAP | 34.25 | ViT-Adapter (BEiTv2-L) |
| 2D Classification | COCO-O | Effective Robustness | 7.79 | ViT-Adapter (BEiTv2-L) |
| 2D Classification | COCO minival | box AP | 60.5 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| 2D Classification | COCO minival | box AP | 60.2 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| 2D Object Detection | COCO test-dev | box mAP | 60.9 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| 2D Object Detection | COCO test-dev | box mAP | 60.4 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| 2D Object Detection | COCO-O | Average mAP | 34.25 | ViT-Adapter (BEiTv2-L) |
| 2D Object Detection | COCO-O | Effective Robustness | 7.79 | ViT-Adapter (BEiTv2-L) |
| 2D Object Detection | COCO minival | box AP | 60.5 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| 2D Object Detection | COCO minival | box AP | 60.2 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| 10-shot image generation | Cityscapes val | mIoU | 85.8 | ViT-Adapter-L |
| 10-shot image generation | ADE20K val | mIoU | 60.5 | ViT-Adapter-L (Mask2Former, BEiT pretrain) |
| 10-shot image generation | ADE20K val | mIoU | 58.4 | ViT-Adapter-L (UperNet, BEiT pretrain) |
| 10-shot image generation | PASCAL Context | mIoU | 68.2 | ViT-Adapter-L (Mask2Former, BEiT pretrain) |
| 10-shot image generation | PASCAL Context | mIoU | 67.5 | ViT-Adapter-L (UperNet, BEiT pretrain) |
| 10-shot image generation | ADE20K | Params (M) | 571 | ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) |
| 10-shot image generation | ADE20K | Validation mIoU | 61.5 | ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) |
| 10-shot image generation | ADE20K | Params (M) | 571 | ViT-Adapter-L (Mask2Former, BEiT pretrain) |
| 10-shot image generation | ADE20K | Validation mIoU | 60.5 | ViT-Adapter-L (Mask2Former, BEiT pretrain) |
| 10-shot image generation | ADE20K | Params (M) | 451 | ViT-Adapter-L (UperNet, BEiT pretrain) |
| 10-shot image generation | ADE20K | Validation mIoU | 58.4 | ViT-Adapter-L (UperNet, BEiT pretrain) |
| 10-shot image generation | COCO minival | AP | 48.9 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| 10-shot image generation | COCO minival | PQ | 58.4 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| 10-shot image generation | COCO minival | PQst | 48.4 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| 10-shot image generation | COCO minival | PQth | 65 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| Panoptic Segmentation | COCO minival | AP | 48.9 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| Panoptic Segmentation | COCO minival | PQ | 58.4 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| Panoptic Segmentation | COCO minival | PQst | 48.4 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| Panoptic Segmentation | COCO minival | PQth | 65 | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) |
| 16k | COCO test-dev | box mAP | 60.9 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| 16k | COCO test-dev | box mAP | 60.4 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |
| 16k | COCO-O | Average mAP | 34.25 | ViT-Adapter (BEiTv2-L) |
| 16k | COCO-O | Effective Robustness | 7.79 | ViT-Adapter (BEiTv2-L) |
| 16k | COCO minival | box AP | 60.5 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) |
| 16k | COCO minival | box AP | 60.2 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) |