TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Multi-Head Linear Attention

Multi-Head Linear Attention

GeneralIntroduced 200024 papers
Source Paper

Description

Multi-Head Linear Attention is a type of linear multi-head self-attention module, proposed with the Linformer architecture. The main idea is to add two linear projection matrices E_i,F_i∈Rn×kE\_{i}, F\_{i} \in \mathbb{R}^{n\times{k}}E_i,F_i∈Rn×k when computing key and value. We first project the original (n×d)\left(n \times d\right)(n×d)-dimensional key and value layers KW_iKKW\_{i}^{K}KW_iK and VW_iVVW\_{i}^{V}VW_iV into (k×d)\left(k\times{d}\right)(k×d)-dimensional projected key and value layers. We then compute a (n×k)\left(n\times{k}\right)(n×k) dimensional context mapping Pˉ\bar{P}Pˉ using scaled-dot product attention:

head_iˉ=Attention(QWQ_i,E_iKW_iK,F_iVW_iV)\bar{\text{head}\_{i}} = \text{Attention}\left(QW^{Q}\_{i}, E\_{i}KW\_{i}^{K}, F\_{i}VW\_{i}^{V}\right)head_iˉ​=Attention(QWQ_i,E_iKW_iK,F_iVW_iV)

head_iˉ=softmax(QWQ_i(E_iKW_iK)Td_k)⋅F_iVW_iV\bar{\text{head}\_{i}} = \text{softmax}\left(\frac{QW^{Q}\_{i}\left(E\_{i}KW\_{i}^{K}\right)^{T}}{\sqrt{d\_{k}}}\right) \cdot F\_{i}VW\_{i}^{V}head_iˉ​=softmax(d_k​QWQ_i(E_iKW_iK)T​)⋅F_iVW_iV

Finally, we compute context embeddings for each head using Pˉ⋅(F_iVW_iV)\bar{P} \cdot \left(F\_{i}{V}W\_{i}^{V}\right)Pˉ⋅(F_iVW_iV).

Papers Using This Method

FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device2025-06-12CacheFormer: High Attention-Based Segment Caching2025-04-18HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution2024-12-04LinFormer: A Linear-based Lightweight Transformer Architecture For Time-Aware MIMO Channel Prediction2024-10-28Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity2024-10-09GLMHA A Guided Low-rank Multi-Head Self-Attention for Efficient Image Restoration and Spectral Reconstruction2024-10-01Attention as a Hypernetwork2024-06-09Sumformer: Universal Approximation for Efficient Transformers2023-07-05RedMotion: Motion Prediction via Redundancy Reduction2023-06-19UMat: Uncertainty-Aware Single Image High Resolution Material Capture2023-05-25MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention2022-11-25Treeformer: Dense Gradient Trees for Efficient Attention Computation2022-08-18Rethinking Attention Mechanism in Time Series Classification2022-07-14Linearizing Transformer with Key-Value Memory2022-03-23Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences2021-12-10Greenformers: Improving Computation and Memory Efficiency in Transformer Models via Low-Rank Approximation2021-08-24Vision Xformers: Efficient Attention for Image Classification2021-07-05Styleformer: Transformer based Generative Adversarial Networks with Style Vector2021-06-13Self-supervised Depth Estimation Leveraging Global Perception and Geometric Smoothness Using On-board Videos2021-06-07A Practical Survey on Faster and Lighter Transformers2021-03-26