Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

Attention Free Transformer, or AFT, is an efficient variant of a multi-head attention module that eschews dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.

Given the input $X$ , AFT first linearly transforms them into $Q=X W^{Q}, K=X W^{K}, V=X W^{V}$ , then performs following operation:

Y=f(X) ; Y\_{t}=\sigma\_{q}\left(Q\_{t}\right) \odot \frac{\sum\_{t^{\prime}=1}^{T} \exp \left(K\_{t^{\prime}}+w\_{t, t^{\prime}}\right) \odot V\_{t^{\prime}}}{\sum\_{t^{\prime}=1}^{T} \exp \left(K\_{t^{\prime}}+w\_{t, t^{\prime}}\right)}

where $\odot$ is the element-wise product; $\sigma\_{q}$ is the nonlinearity applied to the query with default being sigmoid; $w \in R^{T \times T}$ is the learned pair-wise position biases.

Explained in words, for each target position $t$ , AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.

Description

Given the input $X$ , AFT first linearly transforms them into $Q=X W^{Q}, K=X W^{K}, V=X W^{V}$ , then performs following operation:

Y=f(X) ; Y\_{t}=\sigma\_{q}\left(Q\_{t}\right) \odot \frac{\sum\_{t^{\prime}=1}^{T} \exp \left(K\_{t^{\prime}}+w\_{t, t^{\prime}}\right) \odot V\_{t^{\prime}}}{\sum\_{t^{\prime}=1}^{T} \exp \left(K\_{t^{\prime}}+w\_{t, t^{\prime}}\right)}

where $\odot$ is the element-wise product; $\sigma\_{q}$ is the nonlinearity applied to the query with default being sigmoid; $w \in R^{T \times T}$ is the learned pair-wise position biases.

Attention Free Transformer

Description

Papers Using This Method

Attention Free Transformer

Description

Papers Using This Method