Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

Factorized Dense Synthesized Attention is a synthesized attention mechanism, similar to dense synthesized attention, but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the Synthesizer architecture. The factorized variant of the dense synthesizer can be expressed as follows:

$A, B = F\_{A}\left(X\_{i}\right), F\_{B}\left(X\_{i}\right)$

where $F\_{A}\left(.\right)$ projects input $X\_{i}$ into $a$ dimensions, $F\_B\left(.\right)$ projects $X\_{i}$ to $b$ dimensions, and $a \text{ x } b = l$ . The output of the factorized module is now written as:

$Y = \text{Softmax}\left(C\right)G\left(X\right)$

where $C = H\_{A}\left(A\right) * H\_{B}\left(B\right)$ , where $H\_{A}$ , $H\_{B}$ are tiling functions and $C \in \mathbb{R}^{l \text{ x } l}$ . The tiling function simply duplicates the vector $k$ times, i.e., $\mathbb{R}^{l} \rightarrow \mathbb{R}^{lk}$ . In this case, $H\_{A}\left(\right)$ is a projection of $\mathbb{R}^{a} \rightarrow \mathbb{R}^{ab}$ and $H\_{B}\left(\right)$ is a projection of $\mathbb{R}^{b} \rightarrow \mathbb{R}^{ba}$ . To avoid having similar values within the same block, we compose the outputs of $H\_{A}$ and $H\_{B}$ .

Description

$A, B = F\_{A}\left(X\_{i}\right), F\_{B}\left(X\_{i}\right)$

$Y = \text{Softmax}\left(C\right)G\left(X\right)$

Factorized Dense Synthesized Attention

Description

Papers Using This Method

Factorized Dense Synthesized Attention

Description

Papers Using This Method