TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Attention Free Transformer

Attention Free Transformer

GeneralIntroduced 20003 papers
Source Paper

Description

Attention Free Transformer, or AFT, is an efficient variant of a multi-head attention module that eschews dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.

Given the input XXX, AFT first linearly transforms them into Q=XWQ,K=XWK,V=XWVQ=X W^{Q}, K=X W^{K}, V=X W^{V}Q=XWQ,K=XWK,V=XWV, then performs following operation:

Y=f(X);Y_t=σ_q(Q_t)⊙∑_t′=1Texp⁡(K_t′+w_t,t′)⊙V_t′∑_t′=1Texp⁡(K_t′+w_t,t′)Y=f(X) ; Y\_{t}=\sigma\_{q}\left(Q\_{t}\right) \odot \frac{\sum\_{t^{\prime}=1}^{T} \exp \left(K\_{t^{\prime}}+w\_{t, t^{\prime}}\right) \odot V\_{t^{\prime}}}{\sum\_{t^{\prime}=1}^{T} \exp \left(K\_{t^{\prime}}+w\_{t, t^{\prime}}\right)}Y=f(X);Y_t=σ_q(Q_t)⊙∑_t′=1Texp(K_t′+w_t,t′)∑_t′=1Texp(K_t′+w_t,t′)⊙V_t′​

where ⊙\odot⊙ is the element-wise product; σ_q\sigma\_{q}σ_q is the nonlinearity applied to the query with default being sigmoid; w∈RT×Tw \in R^{T \times T}w∈RT×T is the learned pair-wise position biases.

Explained in words, for each target position ttt, AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.

Papers Using This Method

A Dot Product Attention Free Transformer2021-09-29The 2021 Hotel-ID to Combat Human Trafficking Competition Dataset2021-06-10An Attention Free Transformer2021-05-28