TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Generating Long Sequences with Sparse Transformers

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

2019-04-23Preprint 2019 4Question AnsweringOpen-Domain Question AnsweringImage GenerationLanguage Modelling
PaperPDFCodeCodeCodeCodeCodeCode(official)Code

Abstract

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

Results

TaskDatasetMetricValueModel
Image GenerationImageNet 64x64Bits per dim3.44Sparse Transformer 59M (strided)
Question AnsweringQuasart-TEM52.1Sparse Attention
Question AnsweringNatural Questions (long)F174.5Sparse Attention
Question AnsweringSearchQAEM64.7Sparse Attention
Language Modellingenwik8Bit per Character (BPC)0.99Sparse Transformer (30 layers, fixed attn)
Audio GenerationClassical music, 5 seconds at 12 kHzBits per byte1.97Sparse Transformer 152M (strided)
Open-Domain Question AnsweringSearchQAEM64.7Sparse Attention

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17