TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ELSA: Enhanced Local Self-Attention for Vision Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer

Jingkai Zhou, Pichao Wang, Fan Wang, Qiong Liu, Hao Li, Rong Jin

2021-12-23Image ClassificationSemantic SegmentationInstance SegmentationObject Detection
PaperPDFCode(official)

Abstract

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: \emph{channel setting} and \emph{spatial processing}. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer \cite{swin} by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO \cite{volo} from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at \url{https://github.com/damo-cv/ELSA}.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20K valmIoU50.3ELSA-Swin-S
Semantic SegmentationADE20KValidation mIoU50.3ELSA-Swin-S
Object DetectionCOCO minivalAP5070.5ELSA-S (Cascade Mask RCNN)
Object DetectionCOCO minivalAP7556ELSA-S (Cascade Mask RCNN)
Object DetectionCOCO minivalbox AP51.6ELSA-S (Cascade Mask RCNN)
Object DetectionCOCO minivalAP5070.4ELSA-S (Mask RCNN)
Object DetectionCOCO minivalAP7552.9ELSA-S (Mask RCNN)
Object DetectionCOCO minivalbox AP48.3ELSA-S (Mask RCNN)
Image ClassificationImageNetGFLOPs437ELSA-VOLO-D5 (512*512)
Image ClassificationImageNetGFLOPs8ELSA-VOLO-D1
Image ClassificationImageNetGFLOPs4.8ELSA-Swin-T
3DCOCO minivalAP5070.5ELSA-S (Cascade Mask RCNN)
3DCOCO minivalAP7556ELSA-S (Cascade Mask RCNN)
3DCOCO minivalbox AP51.6ELSA-S (Cascade Mask RCNN)
3DCOCO minivalAP5070.4ELSA-S (Mask RCNN)
3DCOCO minivalAP7552.9ELSA-S (Mask RCNN)
3DCOCO minivalbox AP48.3ELSA-S (Mask RCNN)
Instance SegmentationCOCO minivalAP5067.8ELSA-S (Cascade Mask RCNN)
Instance SegmentationCOCO minivalAP7547.8ELSA-S (Cascade Mask RCNN)
Instance SegmentationCOCO minivalmask AP44.4ELSA-S (Cascade Mask RCNN)
Instance SegmentationCOCO minivalAP5067.3ELSA-S (Mask RCNN)
Instance SegmentationCOCO minivalAP7546.4ELSA-S (Mask RCNN)
Instance SegmentationCOCO minivalmask AP43ELSA-S (Mask RCNN)
2D ClassificationCOCO minivalAP5070.5ELSA-S (Cascade Mask RCNN)
2D ClassificationCOCO minivalAP7556ELSA-S (Cascade Mask RCNN)
2D ClassificationCOCO minivalbox AP51.6ELSA-S (Cascade Mask RCNN)
2D ClassificationCOCO minivalAP5070.4ELSA-S (Mask RCNN)
2D ClassificationCOCO minivalAP7552.9ELSA-S (Mask RCNN)
2D ClassificationCOCO minivalbox AP48.3ELSA-S (Mask RCNN)
2D Object DetectionCOCO minivalAP5070.5ELSA-S (Cascade Mask RCNN)
2D Object DetectionCOCO minivalAP7556ELSA-S (Cascade Mask RCNN)
2D Object DetectionCOCO minivalbox AP51.6ELSA-S (Cascade Mask RCNN)
2D Object DetectionCOCO minivalAP5070.4ELSA-S (Mask RCNN)
2D Object DetectionCOCO minivalAP7552.9ELSA-S (Mask RCNN)
2D Object DetectionCOCO minivalbox AP48.3ELSA-S (Mask RCNN)
10-shot image generationADE20K valmIoU50.3ELSA-Swin-S
10-shot image generationADE20KValidation mIoU50.3ELSA-Swin-S
16kCOCO minivalAP5070.5ELSA-S (Cascade Mask RCNN)
16kCOCO minivalAP7556ELSA-S (Cascade Mask RCNN)
16kCOCO minivalbox AP51.6ELSA-S (Cascade Mask RCNN)
16kCOCO minivalAP5070.4ELSA-S (Mask RCNN)
16kCOCO minivalAP7552.9ELSA-S (Mask RCNN)
16kCOCO minivalbox AP48.3ELSA-S (Mask RCNN)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17