Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo

2021-03-25ICCV 2021 10Thermal Image Segmentation Image Classification Real-Time Object Detection Semantic Segmentation Instance Segmentation Object Detection

Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K val	mIoU	53.5	Swin-L (UperNet, ImageNet-22k pretrain)
Semantic Segmentation	ADE20K val	mIoU	49.7	Swin-B (UperNet, ImageNet-1k pretrain)
Semantic Segmentation	FoodSeg103	mIoU	41.6	Swin-Transformer (Swin-Small)
Semantic Segmentation	ADE20K	Test Score	62.8	Swin-L (UperNet, ImageNet-22k pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	53.5	Swin-L (UperNet, ImageNet-22k pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	49.7	Swin-B (UperNet, ImageNet-1k pretrain)
Semantic Segmentation	MFN Dataset	mIOU	49	SwinT
Object Detection	COCO test-dev	box mAP	58.7	Swin-L (HTC++, multi scale)
Object Detection	COCO test-dev	box mAP	57.7	Swin-L (HTC++, single scale)
Object Detection	COCO minival	box AP	58	Swin-L (HTC++, multi scale)
Object Detection	COCO minival	box AP	57.1	Swin-L (HTC++, single scale)
Image Classification	OmniBenchmark	Average Top-1 Accuracy	46.4	SwinTransformer
Image Classification	ImageNet	GFLOPs	103.9	Swin-L
Image Classification	ImageNet	GFLOPs	47	Swin-B
Image Classification	ImageNet	GFLOPs	4.5	Swin-T
3D	COCO test-dev	box mAP	58.7	Swin-L (HTC++, multi scale)
3D	COCO test-dev	box mAP	57.7	Swin-L (HTC++, single scale)
3D	COCO minival	box AP	58	Swin-L (HTC++, multi scale)
3D	COCO minival	box AP	57.1	Swin-L (HTC++, single scale)
Instance Segmentation	COCO minival	mask AP	50.4	Swin-L (HTC++, multi scale)
Instance Segmentation	COCO minival	mask AP	49.5	Swin-L (HTC++, single scale)
Instance Segmentation	Occluded COCO	Mean Recall	62.9	Swin-B + Cascade Mask R-CNN
Instance Segmentation	Occluded COCO	Mean Recall	61.14	Swin-S + Mask R-CNN
Instance Segmentation	Occluded COCO	Mean Recall	58.81	Swin-T + Mask R-CNN
Instance Segmentation	Separated COCO	Mean Recall	36.31	Swin-B + Cascade Mask R-CNN
Instance Segmentation	Separated COCO	Mean Recall	33.67	Swin-S + Mask R-CNN
Instance Segmentation	Separated COCO	Mean Recall	31.94	Swin-T + Mask R-CNN
Instance Segmentation	COCO test-dev	mask AP	51.1	Swin-L (HTC++, multi scale)
Instance Segmentation	COCO test-dev	mask AP	50.2	Swin-L (HTC++, single scale)
2D Classification	COCO test-dev	box mAP	58.7	Swin-L (HTC++, multi scale)
2D Classification	COCO test-dev	box mAP	57.7	Swin-L (HTC++, single scale)
2D Classification	COCO minival	box AP	58	Swin-L (HTC++, multi scale)
2D Classification	COCO minival	box AP	57.1	Swin-L (HTC++, single scale)
Scene Segmentation	MFN Dataset	mIOU	49	SwinT
2D Object Detection	COCO test-dev	box mAP	58.7	Swin-L (HTC++, multi scale)
2D Object Detection	COCO test-dev	box mAP	57.7	Swin-L (HTC++, single scale)
2D Object Detection	COCO minival	box AP	58	Swin-L (HTC++, multi scale)
2D Object Detection	COCO minival	box AP	57.1	Swin-L (HTC++, single scale)
2D Object Detection	MFN Dataset	mIOU	49	SwinT
10-shot image generation	ADE20K val	mIoU	53.5	Swin-L (UperNet, ImageNet-22k pretrain)
10-shot image generation	ADE20K val	mIoU	49.7	Swin-B (UperNet, ImageNet-1k pretrain)
10-shot image generation	FoodSeg103	mIoU	41.6	Swin-Transformer (Swin-Small)
10-shot image generation	ADE20K	Test Score	62.8	Swin-L (UperNet, ImageNet-22k pretrain)
10-shot image generation	ADE20K	Validation mIoU	53.5	Swin-L (UperNet, ImageNet-22k pretrain)
10-shot image generation	ADE20K	Validation mIoU	49.7	Swin-B (UperNet, ImageNet-1k pretrain)
10-shot image generation	MFN Dataset	mIOU	49	SwinT
16k	COCO test-dev	box mAP	58.7	Swin-L (HTC++, multi scale)
16k	COCO test-dev	box mAP	57.7	Swin-L (HTC++, single scale)
16k	COCO minival	box AP	58	Swin-L (HTC++, multi scale)
16k	COCO minival	box AP	57.1	Swin-L (HTC++, single scale)

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Abstract

Results

Related Papers

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Abstract

Results

Related Papers