ELSA: Enhanced Local Self-Attention for Vision Transformer

Jingkai Zhou, Pichao Wang, Fan Wang, Qiong Liu, Hao Li, Rong Jin

2021-12-23Image Classification Semantic Segmentation Instance Segmentation Object Detection

Abstract

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: \emph{channel setting} and \emph{spatial processing}. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer \cite{swin} by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO \cite{volo} from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at \url{https://github.com/damo-cv/ELSA}.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K val	mIoU	50.3	ELSA-Swin-S
Semantic Segmentation	ADE20K	Validation mIoU	50.3	ELSA-Swin-S
Object Detection	COCO minival	AP50	70.5	ELSA-S (Cascade Mask RCNN)
Object Detection	COCO minival	AP75	56	ELSA-S (Cascade Mask RCNN)
Object Detection	COCO minival	box AP	51.6	ELSA-S (Cascade Mask RCNN)
Object Detection	COCO minival	AP50	70.4	ELSA-S (Mask RCNN)
Object Detection	COCO minival	AP75	52.9	ELSA-S (Mask RCNN)
Object Detection	COCO minival	box AP	48.3	ELSA-S (Mask RCNN)
Image Classification	ImageNet	GFLOPs	437	ELSA-VOLO-D5 (512*512)
Image Classification	ImageNet	GFLOPs	8	ELSA-VOLO-D1
Image Classification	ImageNet	GFLOPs	4.8	ELSA-Swin-T
3D	COCO minival	AP50	70.5	ELSA-S (Cascade Mask RCNN)
3D	COCO minival	AP75	56	ELSA-S (Cascade Mask RCNN)
3D	COCO minival	box AP	51.6	ELSA-S (Cascade Mask RCNN)
3D	COCO minival	AP50	70.4	ELSA-S (Mask RCNN)
3D	COCO minival	AP75	52.9	ELSA-S (Mask RCNN)
3D	COCO minival	box AP	48.3	ELSA-S (Mask RCNN)
Instance Segmentation	COCO minival	AP50	67.8	ELSA-S (Cascade Mask RCNN)
Instance Segmentation	COCO minival	AP75	47.8	ELSA-S (Cascade Mask RCNN)
Instance Segmentation	COCO minival	mask AP	44.4	ELSA-S (Cascade Mask RCNN)
Instance Segmentation	COCO minival	AP50	67.3	ELSA-S (Mask RCNN)
Instance Segmentation	COCO minival	AP75	46.4	ELSA-S (Mask RCNN)
Instance Segmentation	COCO minival	mask AP	43	ELSA-S (Mask RCNN)
2D Classification	COCO minival	AP50	70.5	ELSA-S (Cascade Mask RCNN)
2D Classification	COCO minival	AP75	56	ELSA-S (Cascade Mask RCNN)
2D Classification	COCO minival	box AP	51.6	ELSA-S (Cascade Mask RCNN)
2D Classification	COCO minival	AP50	70.4	ELSA-S (Mask RCNN)
2D Classification	COCO minival	AP75	52.9	ELSA-S (Mask RCNN)
2D Classification	COCO minival	box AP	48.3	ELSA-S (Mask RCNN)
2D Object Detection	COCO minival	AP50	70.5	ELSA-S (Cascade Mask RCNN)
2D Object Detection	COCO minival	AP75	56	ELSA-S (Cascade Mask RCNN)
2D Object Detection	COCO minival	box AP	51.6	ELSA-S (Cascade Mask RCNN)
2D Object Detection	COCO minival	AP50	70.4	ELSA-S (Mask RCNN)
2D Object Detection	COCO minival	AP75	52.9	ELSA-S (Mask RCNN)
2D Object Detection	COCO minival	box AP	48.3	ELSA-S (Mask RCNN)
10-shot image generation	ADE20K val	mIoU	50.3	ELSA-Swin-S
10-shot image generation	ADE20K	Validation mIoU	50.3	ELSA-Swin-S
16k	COCO minival	AP50	70.5	ELSA-S (Cascade Mask RCNN)
16k	COCO minival	AP75	56	ELSA-S (Cascade Mask RCNN)
16k	COCO minival	box AP	51.6	ELSA-S (Cascade Mask RCNN)
16k	COCO minival	AP50	70.4	ELSA-S (Mask RCNN)
16k	COCO minival	AP75	52.9	ELSA-S (Mask RCNN)
16k	COCO minival	box AP	48.3	ELSA-S (Mask RCNN)

ELSA: Enhanced Local Self-Attention for Vision Transformer

Abstract

Results

Related Papers

ELSA: Enhanced Local Self-Attention for Vision Transformer

Abstract

Results

Related Papers