Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Multi-Head Linear Attention

Multi-Head Linear Attention

GeneralIntroduced 200024 papers

Description

Multi-Head Linear Attention is a type of linear multi-head self-attention module, proposed with the Linformer architecture. The main idea is to add two linear projection matrices $E\_{i}, F\_{i} \in \mathbb{R}^{n\times{k}}$ when computing key and value. We first project the original $\left(n \times d\right)$ -dimensional key and value layers $KW\_{i}^{K}$ and $VW\_{i}^{V}$ into $\left(k\times{d}\right)$ -dimensional projected key and value layers. We then compute a $\left(n\times{k}\right)$ dimensional context mapping $\bar{P}$ using scaled-dot product attention:

$\bar{\text{head}\_{i}} = \text{Attention}\left(QW^{Q}\_{i}, E\_{i}KW\_{i}^{K}, F\_{i}VW\_{i}^{V}\right)$

$\bar{\text{head}\_{i}} = \text{softmax}\left(\frac{QW^{Q}\_{i}\left(E\_{i}KW\_{i}^{K}\right)^{T}}{\sqrt{d\_{k}}}\right) \cdot F\_{i}VW\_{i}^{V}$

Finally, we compute context embeddings for each head using $\bar{P} \cdot \left(F\_{i}{V}W\_{i}^{V}\right)$ .

Papers Using This Method

FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device2025-06-12 CacheFormer: High Attention-Based Segment Caching2025-04-18 HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution2024-12-04 LinFormer: A Linear-based Lightweight Transformer Architecture For Time-Aware MIMO Channel Prediction2024-10-28 Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity2024-10-09 GLMHA A Guided Low-rank Multi-Head Self-Attention for Efficient Image Restoration and Spectral Reconstruction2024-10-01 Attention as a Hypernetwork2024-06-09 Sumformer: Universal Approximation for Efficient Transformers2023-07-05 RedMotion: Motion Prediction via Redundancy Reduction2023-06-19 UMat: Uncertainty-Aware Single Image High Resolution Material Capture2023-05-25 MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention2022-11-25 Treeformer: Dense Gradient Trees for Efficient Attention Computation2022-08-18 Rethinking Attention Mechanism in Time Series Classification2022-07-14 Linearizing Transformer with Key-Value Memory2022-03-23 Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences2021-12-10 Greenformers: Improving Computation and Memory Efficiency in Transformer Models via Low-Rank Approximation2021-08-24 Vision Xformers: Efficient Attention for Image Classification2021-07-05 Styleformer: Transformer based Generative Adversarial Networks with Style Vector2021-06-13 Self-supervised Depth Estimation Leveraging Global Perception and Geometric Smoothness Using On-board Videos2021-06-07 A Practical Survey on Faster and Lighter Transformers2021-03-26