TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Swin Transformer: Hierarchical Vision Transformer using Sh...

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo

2021-03-25ICCV 2021 10Thermal Image SegmentationImage ClassificationReal-Time Object DetectionSemantic SegmentationInstance SegmentationObject Detection
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20K valmIoU53.5Swin-L (UperNet, ImageNet-22k pretrain)
Semantic SegmentationADE20K valmIoU49.7Swin-B (UperNet, ImageNet-1k pretrain)
Semantic SegmentationFoodSeg103mIoU41.6Swin-Transformer (Swin-Small)
Semantic SegmentationADE20KTest Score62.8Swin-L (UperNet, ImageNet-22k pretrain)
Semantic SegmentationADE20KValidation mIoU53.5Swin-L (UperNet, ImageNet-22k pretrain)
Semantic SegmentationADE20KValidation mIoU49.7Swin-B (UperNet, ImageNet-1k pretrain)
Semantic SegmentationMFN DatasetmIOU49SwinT
Object DetectionCOCO test-devbox mAP58.7Swin-L (HTC++, multi scale)
Object DetectionCOCO test-devbox mAP57.7Swin-L (HTC++, single scale)
Object DetectionCOCO minivalbox AP58Swin-L (HTC++, multi scale)
Object DetectionCOCO minivalbox AP57.1Swin-L (HTC++, single scale)
Image ClassificationOmniBenchmarkAverage Top-1 Accuracy46.4SwinTransformer
Image ClassificationImageNetGFLOPs103.9Swin-L
Image ClassificationImageNetGFLOPs47Swin-B
Image ClassificationImageNetGFLOPs4.5Swin-T
3DCOCO test-devbox mAP58.7Swin-L (HTC++, multi scale)
3DCOCO test-devbox mAP57.7Swin-L (HTC++, single scale)
3DCOCO minivalbox AP58Swin-L (HTC++, multi scale)
3DCOCO minivalbox AP57.1Swin-L (HTC++, single scale)
Instance SegmentationCOCO minivalmask AP50.4Swin-L (HTC++, multi scale)
Instance SegmentationCOCO minivalmask AP49.5Swin-L (HTC++, single scale)
Instance SegmentationOccluded COCOMean Recall62.9Swin-B + Cascade Mask R-CNN
Instance SegmentationOccluded COCOMean Recall61.14Swin-S + Mask R-CNN
Instance SegmentationOccluded COCOMean Recall58.81Swin-T + Mask R-CNN
Instance SegmentationSeparated COCOMean Recall36.31Swin-B + Cascade Mask R-CNN
Instance SegmentationSeparated COCOMean Recall33.67Swin-S + Mask R-CNN
Instance SegmentationSeparated COCOMean Recall31.94Swin-T + Mask R-CNN
Instance SegmentationCOCO test-devmask AP51.1Swin-L (HTC++, multi scale)
Instance SegmentationCOCO test-devmask AP50.2Swin-L (HTC++, single scale)
2D ClassificationCOCO test-devbox mAP58.7Swin-L (HTC++, multi scale)
2D ClassificationCOCO test-devbox mAP57.7Swin-L (HTC++, single scale)
2D ClassificationCOCO minivalbox AP58Swin-L (HTC++, multi scale)
2D ClassificationCOCO minivalbox AP57.1Swin-L (HTC++, single scale)
Scene SegmentationMFN DatasetmIOU49SwinT
2D Object DetectionCOCO test-devbox mAP58.7Swin-L (HTC++, multi scale)
2D Object DetectionCOCO test-devbox mAP57.7Swin-L (HTC++, single scale)
2D Object DetectionCOCO minivalbox AP58Swin-L (HTC++, multi scale)
2D Object DetectionCOCO minivalbox AP57.1Swin-L (HTC++, single scale)
2D Object DetectionMFN DatasetmIOU49SwinT
10-shot image generationADE20K valmIoU53.5Swin-L (UperNet, ImageNet-22k pretrain)
10-shot image generationADE20K valmIoU49.7Swin-B (UperNet, ImageNet-1k pretrain)
10-shot image generationFoodSeg103mIoU41.6Swin-Transformer (Swin-Small)
10-shot image generationADE20KTest Score62.8Swin-L (UperNet, ImageNet-22k pretrain)
10-shot image generationADE20KValidation mIoU53.5Swin-L (UperNet, ImageNet-22k pretrain)
10-shot image generationADE20KValidation mIoU49.7Swin-B (UperNet, ImageNet-1k pretrain)
10-shot image generationMFN DatasetmIOU49SwinT
16kCOCO test-devbox mAP58.7Swin-L (HTC++, multi scale)
16kCOCO test-devbox mAP57.7Swin-L (HTC++, single scale)
16kCOCO minivalbox AP58Swin-L (HTC++, multi scale)
16kCOCO minivalbox AP57.1Swin-L (HTC++, single scale)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17