Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

2019-04-23Preprint 2019 4Question Answering Open-Domain Question Answering Image Generation Language Modelling

Paper PDF Code Code Code Code Code Code(official)Code

Abstract

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

Results

Task	Dataset	Metric	Value	Model
Image Generation	ImageNet 64x64	Bits per dim	3.44	Sparse Transformer 59M (strided)
Question Answering	Quasart-T	EM	52.1	Sparse Attention
Question Answering	Natural Questions (long)	F1	74.5	Sparse Attention
Question Answering	SearchQA	EM	64.7	Sparse Attention
Language Modelling	enwik8	Bit per Character (BPC)	0.99	Sparse Transformer (30 layers, fixed attn)
Audio Generation	Classical music, 5 seconds at 12 kHz	Bits per byte	1.97	Sparse Transformer 152M (strided)
Open-Domain Question Answering	SearchQA	EM	64.7	Sparse Attention

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17 FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17