When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Guangting Wang, Yucheng Zhao, Chuanxin Tang, Chong Luo, Wenjun Zeng

2022-01-26Image Classification Semantic Segmentation Object Detection

Abstract

Attention mechanism has been widely believed as the key to success of vision transformers (ViTs), since it provides a flexible and powerful way to model spatial relationships. However, is the attention mechanism truly an indispensable part of ViT? Can it be replaced by some other alternatives? To demystify the role of attention mechanism, we simplify it into an extremely simple case: ZERO FLOP and ZERO parameter. Concretely, we revisit the shift operation. It does not contain any parameter or arithmetic calculation. The only operation is to exchange a small portion of the channels between neighboring features. Based on this simple operation, we construct a new backbone network, namely ShiftViT, where the attention layers in ViT are substituted by shift operations. Surprisingly, ShiftViT works quite well in several mainstream tasks, e.g., classification, detection, and segmentation. The performance is on par with or even better than the strong baseline Swin Transformer. These results suggest that the attention mechanism might not be the vital factor that makes ViT successful. It can be even replaced by a zero-parameter operation. We should pay more attentions to the remaining parts of ViT in the future work. Code is available at github.com/microsoft/SPACH.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K	Validation mIoU	49.2	Shift-B (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	47.9	Shift-B
Semantic Segmentation	ADE20K	Validation mIoU	47.8	Shift-S
Semantic Segmentation	ADE20K	Validation mIoU	46.3	Shift-T
Object Detection	COCO minival	APM	42.3	Shift-T
Image Classification	ImageNet	GFLOPs	15.2	Shift-B
Image Classification	ImageNet	GFLOPs	8.5	Shift-S
Image Classification	ImageNet	GFLOPs	4.4	Shift-T
3D	COCO minival	APM	42.3	Shift-T
2D Classification	COCO minival	APM	42.3	Shift-T
2D Object Detection	COCO minival	APM	42.3	Shift-T
10-shot image generation	ADE20K	Validation mIoU	49.2	Shift-B (UperNet)
10-shot image generation	ADE20K	Validation mIoU	47.9	Shift-B
10-shot image generation	ADE20K	Validation mIoU	47.8	Shift-S
10-shot image generation	ADE20K	Validation mIoU	46.3	Shift-T
16k	COCO minival	APM	42.3	Shift-T

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Abstract

Results

Related Papers

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Abstract

Results

Related Papers