Multi-Head Linear Attention is a type of linear multi-head self-attention module, proposed with the Linformer architecture. The main idea is to add two linear projection matrices E_i,F_i∈Rn×k when computing key and value. We first project the original (n×d)-dimensional key and value layers KW_iK and VW_iV into (k×d)-dimensional projected key and value layers. We then compute a (n×k) dimensional context mapping Pˉ using scaled-dot product attention:
head_iˉ=Attention(QWQ_i,E_iKW_iK,F_iVW_iV)
head_iˉ=softmax(d_kQWQ_i(E_iKW_iK)T)⋅F_iVW_iV
Finally, we compute context embeddings for each head using Pˉ⋅(F_iVW_iV).