Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

2020-01-13ICLR 2020 1Question Answering Offline RL D4RL Open-Domain Question Answering Image Generation Language Modelling

Paper PDF Code Code Code Code Code Code Code Code Code(official)Code

Abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Results

Task	Dataset	Metric	Value	Model
Image Generation	ImageNet 64x64	Bits per dim	3.71	Reformer (12 layers)
Image Generation	ImageNet 64x64	Bits per dim	3.74	Reformer (6 layers)
Question Answering	Quasart-T	EM	53.2	Locality-Sensitive Hashing
Question Answering	Natural Questions (long)	F1	75.5	Locality-Sensitive Hashing
Question Answering	SearchQA	EM	66	Locality-Sensitive Hashing
Language Modelling	WikiText-103	Test perplexity	26	Reformer 125M
Open-Domain Question Answering	SearchQA	EM	66	Locality-Sensitive Hashing
MuJoCo Games	D4RL	Average Reward	63.9	Reformer

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning2025-07-17 fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17