TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Global Sub-Sampled Attention

Global Sub-Sampled Attention

GeneralIntroduced 20002 papers
Source Paper

Description

Global Sub-Sampled Attention, or GSA, is a local attention mechanism used in the Twins-SVT architecture.

A single representative is used to summarize the key information for each of m×nm \times nm×n subwindows and the representative is used to communicate with other sub-windows (serving as the key in self-attention), which can reduce the cost to O(mnHWd)=O(H2W2dk_1k_2)\mathcal{O}(m n H W d)=\mathcal{O}\left(\frac{H^{2} W^{2} d}{k\_{1} k\_{2}}\right)O(mnHWd)=O(k_1k_2H2W2d​). This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations, and thus it is termed global sub-sampled attention (GSA).

If we alternatively use the LSA and GSA like separable convolutions (depth-wise + point-wise). The total computation cost is O(H2W2dk_1k_2+k_1k_2HWd).\mathcal{O}\left(\frac{H^{2} W^{2} d}{k\_{1} k\_{2}}+k\_{1} k\_{2} H W d\right) .O(k_1k_2H2W2d​+k_1k_2HWd). We have:

H2W2dk_1k_2+k1k2HWd≥2HWdHW\frac{H^{2} W^{2} d}{k\_{1} k\_{2}}+k_{1} k_{2} H W d \geq 2 H W d \sqrt{H W} k_1k_2H2W2d​+k1​k2​HWd≥2HWdHW​

The minimum is obtained when k_1⋅k_2=HWk\_{1} \cdot k\_{2}=\sqrt{H W}k_1⋅k_2=HW​. Note that H=W=224H=W=224H=W=224 is popular in classification. Without loss of generality, square sub-windows are used, i.e., k_1=k_2k\_{1}=k\_{2}k_1=k_2. Therefore, k_1=k_2=15k\_{1}=k\_{2}=15k_1=k_2=15 is close to the global minimum for H=W=224H=W=224H=W=224. However, the network is designed to include several stages with variable resolutions. Stage 1 has feature maps of 56×5656 \times 5656×56, the minimum is obtained when k_1=k_2=56≈7k\_{1}=k\_{2}=\sqrt{56} \approx 7k_1=k_2=56​≈7. Theoretically, we can calibrate optimal k_1k\_{1}k_1 and k_2k\_{2}k_2 for each of the stages. For simplicity, k_1=k_2=7k\_{1}=k\_{2}=7k_1=k_2=7 is used everywhere. As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4,2 and 1 are used for the last three stages respectively.

Papers Using This Method

Logically at Factify 2: A Multi-Modal Fact Checking System Based on Evidence Retrieval techniques and Transformer Encoder Architecture2023-01-09Twins: Revisiting the Design of Spatial Attention in Vision Transformers2021-04-28