TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision Transformer Adapter for Dense Predictions

Vision Transformer Adapter for Dense Predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao

2022-05-17Panoptic SegmentationReal-Time Object DetectionSemantic SegmentationInstance SegmentationObject Detection
PaperPDFCode(official)Code

Abstract

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCityscapes valmIoU85.8ViT-Adapter-L
Semantic SegmentationADE20K valmIoU60.5ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic SegmentationADE20K valmIoU58.4ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic SegmentationPASCAL ContextmIoU68.2ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic SegmentationPASCAL ContextmIoU67.5ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic SegmentationADE20KParams (M)571ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
Semantic SegmentationADE20KValidation mIoU61.5ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
Semantic SegmentationADE20KParams (M)571ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic SegmentationADE20KValidation mIoU60.5ViT-Adapter-L (Mask2Former, BEiT pretrain)
Semantic SegmentationADE20KParams (M)451ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic SegmentationADE20KValidation mIoU58.4ViT-Adapter-L (UperNet, BEiT pretrain)
Semantic SegmentationCOCO minivalAP48.9ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic SegmentationCOCO minivalPQ58.4ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic SegmentationCOCO minivalPQst48.4ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Semantic SegmentationCOCO minivalPQth65ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Object DetectionCOCO test-devbox mAP60.9ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Object DetectionCOCO test-devbox mAP60.4ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Object DetectionCOCO-OAverage mAP34.25ViT-Adapter (BEiTv2-L)
Object DetectionCOCO-OEffective Robustness7.79ViT-Adapter (BEiTv2-L)
Object DetectionCOCO minivalbox AP60.5ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Object DetectionCOCO minivalbox AP60.2ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
3DCOCO test-devbox mAP60.9ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
3DCOCO test-devbox mAP60.4ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
3DCOCO-OAverage mAP34.25ViT-Adapter (BEiTv2-L)
3DCOCO-OEffective Robustness7.79ViT-Adapter (BEiTv2-L)
3DCOCO minivalbox AP60.5ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
3DCOCO minivalbox AP60.2ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Instance SegmentationCOCO minivalmask AP54.2ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)
Instance SegmentationCOCO minivalmask AP52.5ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Instance SegmentationCOCO minivalmask AP52.2ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
Instance SegmentationCOCO test-devmask AP54.5ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale)
Instance SegmentationCOCO test-devmask AP53ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
Instance SegmentationCOCO test-devmask AP52.5ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D ClassificationCOCO test-devbox mAP60.9ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D ClassificationCOCO test-devbox mAP60.4ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D ClassificationCOCO-OAverage mAP34.25ViT-Adapter (BEiTv2-L)
2D ClassificationCOCO-OEffective Robustness7.79ViT-Adapter (BEiTv2-L)
2D ClassificationCOCO minivalbox AP60.5ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D ClassificationCOCO minivalbox AP60.2ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Object DetectionCOCO test-devbox mAP60.9ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Object DetectionCOCO test-devbox mAP60.4ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
2D Object DetectionCOCO-OAverage mAP34.25ViT-Adapter (BEiTv2-L)
2D Object DetectionCOCO-OEffective Robustness7.79ViT-Adapter (BEiTv2-L)
2D Object DetectionCOCO minivalbox AP60.5ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
2D Object DetectionCOCO minivalbox AP60.2ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
10-shot image generationCityscapes valmIoU85.8ViT-Adapter-L
10-shot image generationADE20K valmIoU60.5ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generationADE20K valmIoU58.4ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generationPASCAL ContextmIoU68.2ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generationPASCAL ContextmIoU67.5ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generationADE20KParams (M)571ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
10-shot image generationADE20KValidation mIoU61.5ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
10-shot image generationADE20KParams (M)571ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generationADE20KValidation mIoU60.5ViT-Adapter-L (Mask2Former, BEiT pretrain)
10-shot image generationADE20KParams (M)451ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generationADE20KValidation mIoU58.4ViT-Adapter-L (UperNet, BEiT pretrain)
10-shot image generationCOCO minivalAP48.9ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generationCOCO minivalPQ58.4ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generationCOCO minivalPQst48.4ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
10-shot image generationCOCO minivalPQth65ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic SegmentationCOCO minivalAP48.9ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic SegmentationCOCO minivalPQ58.4ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic SegmentationCOCO minivalPQst48.4ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
Panoptic SegmentationCOCO minivalPQth65ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former)
16kCOCO test-devbox mAP60.9ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
16kCOCO test-devbox mAP60.4ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
16kCOCO-OAverage mAP34.25ViT-Adapter (BEiTv2-L)
16kCOCO-OEffective Robustness7.79ViT-Adapter (BEiTv2-L)
16kCOCO minivalbox AP60.5ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
16kCOCO minivalbox AP60.2ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17