Spatial-Reduction Attention, or SRA, is a multi-head attention module used in the Pyramid Vision Transformer architecture which reduces the spatial scale of the key and value before the attention operation. This reduces the computational/memory overhead. Details of the SRA in the stage can be formulated as follows:
\text{SRA}(Q, K, V)=\text { Concat }\left(\operatorname{head}\_{0}, \ldots \text { head }\_{N\_{i}}\right) W^{O} $$ $$\text{ head}\_{j}=\text { Attention }\left(Q W\_{j}^{Q}, \operatorname{SR}(K) W\_{j}^{K}, \operatorname{SR}(V) W\_{j}^{V}\right)where Concat is the concatenation operation. , , , and are linear projection parameters. is the head number of the attention layer in Stage . Therefore, the dimension of each head (i.e. is equal to is the operation for reducing the spatial dimension of the input sequence ( or ), which is written as:
Here, represents a input sequence, and denotes the reduction ratio of the attention layers in Stage Reshape is an operation of reshaping the input sequence to a sequence of size . is a linear projection that reduces the dimension of the input sequence to . refers to layer normalization.