gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Mocho Go, Hideyuki Tachibana

2022-08-24Image Classification Semantic Segmentation Instance Segmentation object-detection Object Detection

Abstract

Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K val	Pixel Accuracy	83.43	gSwin-S
Semantic Segmentation	ADE20K val	mIoU	49.69	gSwin-S
Semantic Segmentation	ADE20K val	Pixel Accuracy	82.6	gSwin-T
Semantic Segmentation	ADE20K val	mIoU	47.63	gSwin-T
Semantic Segmentation	ADE20K val	Pixel Accuracy	81.79	gSwin-VT
Semantic Segmentation	ADE20K val	mIoU	45.07	gSwin-VT
Image Classification	ImageNet	GFLOPs	7	gSwin-S
Image Classification	ImageNet	GFLOPs	3.6	gSwin-T
Image Classification	ImageNet	GFLOPs	2.3	gSwin-VT
Instance Segmentation	COCO test-dev	mask AP	45.03	gSwin-S
Instance Segmentation	COCO test-dev	mask AP	44.16	gSwin-T
Instance Segmentation	COCO test-dev	mask AP	42.87	gSwin-VT
10-shot image generation	ADE20K val	Pixel Accuracy	83.43	gSwin-S
10-shot image generation	ADE20K val	mIoU	49.69	gSwin-S
10-shot image generation	ADE20K val	Pixel Accuracy	82.6	gSwin-T
10-shot image generation	ADE20K val	mIoU	47.63	gSwin-T
10-shot image generation	ADE20K val	Pixel Accuracy	81.79	gSwin-VT
10-shot image generation	ADE20K val	mIoU	45.07	gSwin-VT

gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Abstract

Results

Related Papers

gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Abstract

Results

Related Papers