Papers With Code 2 | ML Benchmarks, SotA Results & Code

SortCut Sinkhorn Attention is a variant of Sparse Sinkhorn Attention where a post-sorting truncation of the input sequence is performed, essentially performing a hard top-k operation on the input sequence blocks within the computational graph. While most attention models mainly re-weight or assign near-zero weights during training, this allows for explicitly and dynamically truncate the input sequence. Specifically:

$Y = \text{Softmax}\left(Q{\psi\_{S}}\left(K\right)^{T}\_{\left[:n\right]}\right)\psi\_{S}\left(V\right)\_{\left[:n\right]}$

where $n$ is the Sortfut budget hyperparameter.

SortCut Sinkhorn Attention

Description

Papers Using This Method