Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

Location Sensitive Attention is an attention mechanism that extends the additive attention mechanism to use cumulative attention weights from previous decoder time steps as an additional feature. This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.

Starting with additive attention where $h$ is a sequential representation from a BiRNN encoder and ${s}\_{i-1}$ is the $(i − 1)$ -th state of a recurrent neural network (e.g. a LSTM or GRU):

$e\_{i, j} = w^{T}\tanh\left(W{s}\_{i-1} + Vh\_{j} + b\right)$

where $w$ and $b$ are vectors, $W$ and $V$ are matrices. We extend this to be location-aware by making it take into account the alignment produced at the previous step. First, we extract $k$ vectors $f\_{i,j} \in \mathbb{R}^{k}$ for every position $j$ of the previous alignment $\alpha\_{i−1}$ by convolving it with a matrix $F \in R^{k\times{r}}$ :

$f\_{i} = F ∗ \alpha\_{i−1}$

These additional vectors $f\_{i,j}$ are then used by the scoring mechanism $e\_{i,j}$ :

$e\_{i,j} = w^{T}\tanh\left(Ws\_{i−1} + Vh\_{j} + Uf\_{i,j} + b\right)$

Description

Starting with additive attention where $h$ is a sequential representation from a BiRNN encoder and ${s}\_{i-1}$ is the $(i − 1)$ -th state of a recurrent neural network (e.g. a LSTM or GRU):

$e\_{i, j} = w^{T}\tanh\left(W{s}\_{i-1} + Vh\_{j} + b\right)$

$f\_{i} = F ∗ \alpha\_{i−1}$

These additional vectors $f\_{i,j}$ are then used by the scoring mechanism $e\_{i,j}$ :

$e\_{i,j} = w^{T}\tanh\left(Ws\_{i−1} + Vh\_{j} + Uf\_{i,j} + b\right)$

Location Sensitive Attention

Description

Papers Using This Method

Location Sensitive Attention

Description

Papers Using This Method