TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Locally-Grouped Self-Attention

Locally-Grouped Self-Attention

GeneralIntroduced 20002 papers
Source Paper

Description

Locally-Grouped Self-Attention, or LSA, is a local attention mechanism used in the Twins-SVT architecture. Locally-grouped self-attention (LSA). Motivated by the group design in depthwise convolutions for efficient inference, we first equally divide the 2D feature maps into sub-windows, making self-attention communications only happen within each sub-window. This design also resonates with the multi-head design in self-attention, where the communications only occur within the channels of the same head. To be specific, the feature maps are divided into m×nm \times nm×n sub-windows. Without loss of generality, we assume H%m=0H \% m=0H%m=0 and W%n=0W \% n=0W%n=0. Each group contains HWmn\frac{H W}{m n}mnHW​ elements, and thus the computation cost of the self-attention in this window is O(H2W2m2n2d)\mathcal{O}\left(\frac{H^{2} W^{2}}{m^{2} n^{2}} d\right)O(m2n2H2W2​d), and the total cost is O(H2W2mnd)\mathcal{O}\left(\frac{H^{2} W^{2}}{m n} d\right)O(mnH2W2​d). If we let k_1=Hnk\_{1}=\frac{H}{n}k_1=nH​ and k_2=Wnk\_{2}=\frac{W}{n}k_2=nW​, the cost can be computed as O(k_1k_2HWd)\mathcal{O}\left(k\_{1} k\_{2} H W d\right)O(k_1k_2HWd), which is significantly more efficient when k_1≪Hk\_{1} \ll Hk_1≪H and k_2≪Wk\_{2} \ll Wk_2≪W and grows linearly with HWH WHW if k_1k\_{1}k_1 and k_2k\_{2}k_2 are fixed.

Although the locally-grouped self-attention mechanism is computation friendly, the image is divided into non-overlapping sub-windows. Thus, we need a mechanism to communicate between different sub-windows, as in Swin. Otherwise, the information would be limited to be processed locally, which makes the receptive field small and significantly degrades the performance as shown in our experiments. This resembles the fact that we cannot replace all standard convolutions by depth-wise convolutions in CNNs.

Papers Using This Method

Logically at Factify 2: A Multi-Modal Fact Checking System Based on Evidence Retrieval techniques and Transformer Encoder Architecture2023-01-09Twins: Revisiting the Design of Spatial Attention in Vision Transformers2021-04-28