TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Transformer-XL: Attentive Language Models Beyond a Fixed-L...

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

2019-01-09ACL 2019 7Language Modelling
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

Results

TaskDatasetMetricValueModel
Language ModellingPenn Treebank (Word Level)Test perplexity54.55Transformer-XL
Language ModellingPenn Treebank (Word Level)Validation perplexity56.72Transformer-XL
Language ModellingWikiText-103Test perplexity18.3Transformer-XL Large
Language ModellingWikiText-103Validation perplexity18.2Transformer-XL Large
Language ModellingWikiText-103Test perplexity24Transformer-XL Standard
Language ModellingWikiText-103Validation perplexity23.1Transformer-XL Standard
Language ModellingText8Bit per Character (BPC)1.08Transformer-XL - 24 layers
Language ModellingHutter PrizeBit per Character (BPC)0.9924-layer Transformer-XL
Language ModellingHutter PrizeBit per Character (BPC)1.0318-layer Transformer-XL
Language ModellingHutter PrizeBit per Character (BPC)1.0612-layer Transformer-XL
Language ModellingOne Billion WordPPL21.8Transformer-XL Large
Language ModellingOne Billion WordPPL23.5Transformer-XL Base
Language Modellingenwik8Bit per Character (BPC)0.99Transformer-XL (24 layers)
Language Modellingenwik8Bit per Character (BPC)1.03Transformer-XL (18 layers)
Language Modellingenwik8Bit per Character (BPC)1.06Transformer-XL (12 layers)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16