Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Generation | ImageNet 64x64 | Bits per dim | 3.71 | Reformer (12 layers) |
| Image Generation | ImageNet 64x64 | Bits per dim | 3.74 | Reformer (6 layers) |
| Question Answering | Quasart-T | EM | 53.2 | Locality-Sensitive Hashing |
| Question Answering | Natural Questions (long) | F1 | 75.5 | Locality-Sensitive Hashing |
| Question Answering | SearchQA | EM | 66 | Locality-Sensitive Hashing |
| Language Modelling | WikiText-103 | Test perplexity | 26 | Reformer 125M |
| Open-Domain Question Answering | SearchQA | EM | 66 | Locality-Sensitive Hashing |
| MuJoCo Games | D4RL | Average Reward | 63.9 | Reformer |