TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Medical/Language Modelling/enwik8

Language Modelling on enwik8

Metric: Bit per Character (BPC) (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Bit per Character (BPC)▼Extra DataPaperDate↕Code
1LSTM (7 layers)1.67NoGenerating Sequences With Recurrent Neural Netwo...2013-08-04Code
2Hypernetworks1.34NoHyperNetworks2016-09-27Code
3SHA-LSTM (4 layers, h=1024, no attention head)1.33NoSingle Headed Attention RNN: Stop Thinking With ...2019-11-26Code
4LN HM-LSTM1.32NoHierarchical Multiscale Recurrent Neural Networks2016-09-06Code
5ByteNet1.31NoNeural Machine Translation in Linear Time2016-10-31Code
6Recurrent Highway Networks1.27NoRecurrent Highway Networks2016-07-12Code
7Large FS-LSTM-41.25NoFast-Slow Recurrent Neural Networks2017-05-24Code
8Large mLSTM1.24NoMultiplicative LSTM for sequence modelling2016-09-26Code
9AWD-LSTM (3 layers)1.232NoAn Analysis of Neural Language Modeling at Multi...2018-03-22Code
10Cluster-Former (#C=512)1.22NoCluster-Former: Clustering-based Sparse Transfor...2020-09-13-
11LSTM1.195NoMogrifier LSTM2019-09-04Code
12Mogrifier LSTM1.146NoMogrifier LSTM2019-09-04Code
1364-layer Character Transformer Model1.11NoCharacter-Level Language Modeling with Deeper Se...2018-08-09Code
14SHA-RNN (4 layers, h=1024, single attention head)1.076NoSingle Headed Attention RNN: Stop Thinking With ...2019-11-26Code
15SHA-RNN (4 layers, h=1024, attention head per layer)1.068NoSingle Headed Attention RNN: Stop Thinking With ...2019-11-26Code
16Transformer (64 layers)1.06NoCharacter-Level Language Modeling with Deeper Se...2018-08-09Code
17Transformer-XL (12 layers)1.06NoTransformer-XL: Attentive Language Models Beyond...2019-01-09Code
18Skip Cross-Head Transformer-XL1.033NoMemory-efficient Stochastic methods for Memory-b...2023-11-14Code
19Transformer-XL (18 layers)1.03NoTransformer-XL: Attentive Language Models Beyond...2019-01-09Code
20Transformer+SSA1.024NoThe Information Pathways Hypothesis: Transformer...2023-06-02Code
21Transformer (12 layers, 8k adaptive span)1.02NoAdaptive Attention Span in Transformers2019-05-19Code
22BP-Transformer (12 layers)1.02NoBP-Transformer: Modelling Long-Range Context via...2019-11-11Code
23All-attention network (18 layers)1.01NoAugmenting Self-attention with Persistent Memory2019-07-02Code
24Longformer (12 layers, h=512)1NoLongformer: The Long-Document Transformer2020-04-10Code
25Hourglass0.997NoHierarchical Transformers Are More Efficient Lan...2021-10-26Code
26Transformer-XL (24 layers)0.99NoTransformer-XL: Attentive Language Models Beyond...2019-01-09Code
27Longformer (30 layers, h=512)0.99NoLongformer: The Long-Document Transformer2020-04-10Code
28Sparse Transformer (30 layers, fixed attn)0.99NoGenerating Long Sequences with Sparse Transformers2019-04-23Code
29Routing Transformer (12 layers)0.99NoEfficient Content-Based Sparse Attention with Ro...2020-03-12Code
30Transformer-LS (small)0.99NoLong-Short Transformer: Efficient Transformers f...2021-07-05Code
31Transformer (24 layers, 8k adaptive span)0.98NoAdaptive Attention Span in Transformers2019-05-19Code
32Compressive Transformer (24 layers)0.97NoCompressive Transformers for Long-Range Sequence...2019-11-13Code
33Transformer-LS (large)0.97NoLong-Short Transformer: Efficient Transformers f...2021-07-05Code
34SRU++ Base0.97NoWhen Attention Meets Fast Recurrence: Training L...2021-02-24Code
35Sandwich Transformer (adaptive span)0.968NoImproving Transformer Models by Reordering their...2019-11-10Code
36Feedback Transformer0.96NoAddressing Some Limitations of Transformers with...2020-02-21Code
37Expire-Span (24 layers)0.95NoNot All Memories are Created Equal: Learning to ...2021-05-13Code
38SRU++ Large0.95NoWhen Attention Meets Fast Recurrence: Training L...2021-02-24Code
39Transformer-XL (24 layers, RMS dynamic eval, decay)0.94YesDynamic Evaluation of Transformer Language Models2019-04-17Code
40Focus0.94NoFocus Your Attention (with Adaptive IIR Filters)2023-05-24-
41GPT-2 (48 layers, h=1600)0.93Yes--Code