TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Spatial-Reduction Attention

Spatial-Reduction Attention

GeneralIntroduced 200029 papers
Source Paper

Description

Spatial-Reduction Attention, or SRA, is a multi-head attention module used in the Pyramid Vision Transformer architecture which reduces the spatial scale of the key KKK and value VVV before the attention operation. This reduces the computational/memory overhead. Details of the SRA in the stage iii can be formulated as follows:

\text{SRA}(Q, K, V)=\text { Concat }\left(\operatorname{head}\_{0}, \ldots \text { head }\_{N\_{i}}\right) W^{O} $$ $$\text{ head}\_{j}=\text { Attention }\left(Q W\_{j}^{Q}, \operatorname{SR}(K) W\_{j}^{K}, \operatorname{SR}(V) W\_{j}^{V}\right)

where Concat (⋅)(\cdot)(⋅) is the concatenation operation. W_jQ∈RC_i×d_head W\_{j}^{Q} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}W_jQ∈RC_i×d_head , W_jK∈RC_i×d_head W\_{j}^{K} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}W_jK∈RC_i×d_head , W_jV∈RC_i×d_head W\_{j}^{V} \in \mathbb{R}^{C\_{i} \times d\_{\text {head }}}W_jV∈RC_i×d_head , and WO∈RC_i×C_iW^{O} \in \mathbb{R}^{C\_{i} \times C\_{i}}WO∈RC_i×C_i are linear projection parameters. N_iN\_{i}N_i is the head number of the attention layer in Stage iii. Therefore, the dimension of each head (i.e. d_head )\left.d\_{\text {head }}\right)d_head ) is equal to C_iN_i.SR(⋅)\frac{C\_{i}}{N\_{i}} . \text{SR}(\cdot)N_iC_i​.SR(⋅) is the operation for reducing the spatial dimension of the input sequence (KKK or VVV ), which is written as:

SR(x)=Norm(Reshape⁡(x,R_i)WS)\text{SR}(\mathbf{x})=\text{Norm}\left(\operatorname{Reshape}\left(\mathbf{x}, R\_{i}\right) W^{S}\right)SR(x)=Norm(Reshape(x,R_i)WS)

Here, x∈R(H_iW_i)×C_i\mathbf{x} \in \mathbb{R}^{\left(H\_{i} W\_{i}\right) \times C\_{i}}x∈R(H_iW_i)×C_i represents a input sequence, and R_iR\_{i}R_i denotes the reduction ratio of the attention layers in Stage i.i .i. Reshape (x,R_i)\left(\mathbf{x}, R\_{i}\right)(x,R_i) is an operation of reshaping the input sequence x\mathbf{x}x to a sequence of size H_iW_iR_i2×(R_i2C_i)\frac{H\_{i} W\_{i}}{R\_{i}^{2}} \times\left(R\_{i}^{2} C\_{i}\right)R_i2H_iW_i​×(R_i2C_i). W_S∈R(R_i2C_i)×C_iW\_{S} \in \mathbb{R}^{\left(R\_{i}^{2} C\_{i}\right) \times C\_{i}}W_S∈R(R_i2C_i)×C_i is a linear projection that reduces the dimension of the input sequence to C_iC\_{i}C_i. Norm(⋅)\text{Norm}(\cdot)Norm(⋅) refers to layer normalization.

Papers Using This Method

GLOVA: Global and Local Variation-Aware Analog Circuit Design with Risk-Sensitive Reinforcement Learning2025-05-16Crystal Oscillators in OSNMA-Enabled Receivers: An Implementation View for Automotive Applications2025-01-25Multipath Mitigation Technology-integrated GNSS Direct Position Estimation Plug-in Module2024-11-20HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation2024-10-29Low-Rank Continual Pyramid Vision Transformer: Incrementally Segment Whole-Body Organs in CT with Light-Weighted Adaptation2024-10-07Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies2024-05-24Rethinking Attention Gated with Hybrid Dual Pyramid Transformer-CNN for Generalized Segmentation in Medical Imaging2024-04-28Multi-Layer Dense Attention Decoder for Polyp Segmentation2024-03-27Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis2024-03-26ROI-Aware Multiscale Cross-Attention Vision Transformer for Pest Image Identification2023-12-28Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition2023-11-02SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation2023-10-16DAT++: Spatially Dynamic Vision Transformer with Deformable Attention2023-09-04A denoised Mean Teacher for domain adaptive point cloud registration2023-06-26A 3-step Low-latency Low-Power Multichannel Time-to-Digital Converter based on Time Residual Amplifier2023-06-01Neural correlates of cognitive ability and visuo-motor speed: validation of IDoCT on UK Biobank Data2023-05-30PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer2023-05-11Sector Patch Embedding: An Embedding Module Conforming to The Distortion Pattern of Fisheye Image2023-03-26Exploring the Relationship Between Architectural Design and Adversarially Robust Generalization2023-01-01Chasing Clouds: Differentiable Volumetric Rasterisation of Point Clouds as a Highly Efficient and Accurate Loss for Large-Scale Deformable 3D Registration2023-01-01