Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Medical
/
Language Modelling
/
enwik8
Language Modelling on enwik8
Metric: Bit per Character (BPC) (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Bit per Character (BPC) (best first)
Bit per Character (BPC) (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Bit per Character (BPC)
▼
Extra Data
Paper
Date
↕
Code
1
LSTM (7 layers)
1.67
No
Generating Sequences With Recurrent Neural Netwo...
2013-08-04
Code
2
Hypernetworks
1.34
No
HyperNetworks
2016-09-27
Code
3
SHA-LSTM (4 layers, h=1024, no attention head)
1.33
No
Single Headed Attention RNN: Stop Thinking With ...
2019-11-26
Code
4
LN HM-LSTM
1.32
No
Hierarchical Multiscale Recurrent Neural Networks
2016-09-06
Code
5
ByteNet
1.31
No
Neural Machine Translation in Linear Time
2016-10-31
Code
6
Recurrent Highway Networks
1.27
No
Recurrent Highway Networks
2016-07-12
Code
7
Large FS-LSTM-4
1.25
No
Fast-Slow Recurrent Neural Networks
2017-05-24
Code
8
Large mLSTM
1.24
No
Multiplicative LSTM for sequence modelling
2016-09-26
Code
9
AWD-LSTM (3 layers)
1.232
No
An Analysis of Neural Language Modeling at Multi...
2018-03-22
Code
10
Cluster-Former (#C=512)
1.22
No
Cluster-Former: Clustering-based Sparse Transfor...
2020-09-13
-
11
LSTM
1.195
No
Mogrifier LSTM
2019-09-04
Code
12
Mogrifier LSTM
1.146
No
Mogrifier LSTM
2019-09-04
Code
13
64-layer Character Transformer Model
1.11
No
Character-Level Language Modeling with Deeper Se...
2018-08-09
Code
14
SHA-RNN (4 layers, h=1024, single attention head)
1.076
No
Single Headed Attention RNN: Stop Thinking With ...
2019-11-26
Code
15
SHA-RNN (4 layers, h=1024, attention head per layer)
1.068
No
Single Headed Attention RNN: Stop Thinking With ...
2019-11-26
Code
16
Transformer (64 layers)
1.06
No
Character-Level Language Modeling with Deeper Se...
2018-08-09
Code
17
Transformer-XL (12 layers)
1.06
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
18
Skip Cross-Head Transformer-XL
1.033
No
Memory-efficient Stochastic methods for Memory-b...
2023-11-14
Code
19
Transformer-XL (18 layers)
1.03
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
20
Transformer+SSA
1.024
No
The Information Pathways Hypothesis: Transformer...
2023-06-02
Code
21
Transformer (12 layers, 8k adaptive span)
1.02
No
Adaptive Attention Span in Transformers
2019-05-19
Code
22
BP-Transformer (12 layers)
1.02
No
BP-Transformer: Modelling Long-Range Context via...
2019-11-11
Code
23
All-attention network (18 layers)
1.01
No
Augmenting Self-attention with Persistent Memory
2019-07-02
Code
24
Longformer (12 layers, h=512)
1
No
Longformer: The Long-Document Transformer
2020-04-10
Code
25
Hourglass
0.997
No
Hierarchical Transformers Are More Efficient Lan...
2021-10-26
Code
26
Transformer-XL (24 layers)
0.99
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
27
Longformer (30 layers, h=512)
0.99
No
Longformer: The Long-Document Transformer
2020-04-10
Code
28
Sparse Transformer (30 layers, fixed attn)
0.99
No
Generating Long Sequences with Sparse Transformers
2019-04-23
Code
29
Routing Transformer (12 layers)
0.99
No
Efficient Content-Based Sparse Attention with Ro...
2020-03-12
Code
30
Transformer-LS (small)
0.99
No
Long-Short Transformer: Efficient Transformers f...
2021-07-05
Code
31
Transformer (24 layers, 8k adaptive span)
0.98
No
Adaptive Attention Span in Transformers
2019-05-19
Code
32
Compressive Transformer (24 layers)
0.97
No
Compressive Transformers for Long-Range Sequence...
2019-11-13
Code
33
Transformer-LS (large)
0.97
No
Long-Short Transformer: Efficient Transformers f...
2021-07-05
Code
34
SRU++ Base
0.97
No
When Attention Meets Fast Recurrence: Training L...
2021-02-24
Code
35
Sandwich Transformer (adaptive span)
0.968
No
Improving Transformer Models by Reordering their...
2019-11-10
Code
36
Feedback Transformer
0.96
No
Addressing Some Limitations of Transformers with...
2020-02-21
Code
37
Expire-Span (24 layers)
0.95
No
Not All Memories are Created Equal: Learning to ...
2021-05-13
Code
38
SRU++ Large
0.95
No
When Attention Meets Fast Recurrence: Training L...
2021-02-24
Code
39
Transformer-XL (24 layers, RMS dynamic eval, decay)
0.94
Yes
Dynamic Evaluation of Transformer Language Models
2019-04-17
Code
40
Focus
0.94
No
Focus Your Attention (with Adaptive IIR Filters)
2023-05-24
-
41
GPT-2 (48 layers, h=1600)
0.93
Yes
-
-
Code
#1
LSTM (7 layers)
SOTA
1.67
Bit per Character (BPC)
· 2013-08-04
Generating Sequences With Recurrent Neural Networks
Code
#2
Hypernetworks
1.34
Bit per Character (BPC)
· 2016-09-27
HyperNetworks
Code
#3
SHA-LSTM (4 layers, h=1024, no attention head)
1.33
Bit per Character (BPC)
· 2019-11-26
Single Headed Attention RNN: Stop Thinking With Your Head
Code
#4
LN HM-LSTM
1.32
Bit per Character (BPC)
· 2016-09-06
Hierarchical Multiscale Recurrent Neural Networks
Code
#5
ByteNet
1.31
Bit per Character (BPC)
· 2016-10-31
Neural Machine Translation in Linear Time
Code
#6
Recurrent Highway Networks
1.27
Bit per Character (BPC)
· 2016-07-12
Recurrent Highway Networks
Code
#7
Large FS-LSTM-4
1.25
Bit per Character (BPC)
· 2017-05-24
Fast-Slow Recurrent Neural Networks
Code
#8
Large mLSTM
1.24
Bit per Character (BPC)
· 2016-09-26
Multiplicative LSTM for sequence modelling
Code
#9
AWD-LSTM (3 layers)
1.232
Bit per Character (BPC)
· 2018-03-22
An Analysis of Neural Language Modeling at Multiple Scales
Code
#10
Cluster-Former (#C=512)
1.22
Bit per Character (BPC)
· 2020-09-13
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding
#11
LSTM
1.195
Bit per Character (BPC)
· 2019-09-04
Mogrifier LSTM
Code
#12
Mogrifier LSTM
1.146
Bit per Character (BPC)
· 2019-09-04
Mogrifier LSTM
Code
#13
64-layer Character Transformer Model
1.11
Bit per Character (BPC)
· 2018-08-09
Character-Level Language Modeling with Deeper Self-Attention
Code
#14
SHA-RNN (4 layers, h=1024, single attention head)
1.076
Bit per Character (BPC)
· 2019-11-26
Single Headed Attention RNN: Stop Thinking With Your Head
Code
#15
SHA-RNN (4 layers, h=1024, attention head per layer)
1.068
Bit per Character (BPC)
· 2019-11-26
Single Headed Attention RNN: Stop Thinking With Your Head
Code
#16
Transformer (64 layers)
1.06
Bit per Character (BPC)
· 2018-08-09
Character-Level Language Modeling with Deeper Self-Attention
Code
#17
Transformer-XL (12 layers)
1.06
Bit per Character (BPC)
· 2019-01-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Code
#18
Skip Cross-Head Transformer-XL
1.033
Bit per Character (BPC)
· 2023-11-14
Memory-efficient Stochastic methods for Memory-based Transformers
Code
#19
Transformer-XL (18 layers)
1.03
Bit per Character (BPC)
· 2019-01-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Code
#20
Transformer+SSA
1.024
Bit per Character (BPC)
· 2023-06-02
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles
Code
#21
Transformer (12 layers, 8k adaptive span)
1.02
Bit per Character (BPC)
· 2019-05-19
Adaptive Attention Span in Transformers
Code
#22
BP-Transformer (12 layers)
1.02
Bit per Character (BPC)
· 2019-11-11
BP-Transformer: Modelling Long-Range Context via Binary Partitioning
Code
#23
All-attention network (18 layers)
1.01
Bit per Character (BPC)
· 2019-07-02
Augmenting Self-attention with Persistent Memory
Code
#24
Longformer (12 layers, h=512)
1
Bit per Character (BPC)
· 2020-04-10
Longformer: The Long-Document Transformer
Code
#25
Hourglass
0.997
Bit per Character (BPC)
· 2021-10-26
Hierarchical Transformers Are More Efficient Language Models
Code
#26
Transformer-XL (24 layers)
0.99
Bit per Character (BPC)
· 2019-01-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Code
#27
Longformer (30 layers, h=512)
0.99
Bit per Character (BPC)
· 2020-04-10
Longformer: The Long-Document Transformer
Code
#28
Sparse Transformer (30 layers, fixed attn)
0.99
Bit per Character (BPC)
· 2019-04-23
Generating Long Sequences with Sparse Transformers
Code
#29
Routing Transformer (12 layers)
0.99
Bit per Character (BPC)
· 2020-03-12
Efficient Content-Based Sparse Attention with Routing Transformers
Code
#30
Transformer-LS (small)
0.99
Bit per Character (BPC)
· 2021-07-05
Long-Short Transformer: Efficient Transformers for Language and Vision
Code
#31
Transformer (24 layers, 8k adaptive span)
0.98
Bit per Character (BPC)
· 2019-05-19
Adaptive Attention Span in Transformers
Code
#32
Compressive Transformer (24 layers)
0.97
Bit per Character (BPC)
· 2019-11-13
Compressive Transformers for Long-Range Sequence Modelling
Code
#33
Transformer-LS (large)
0.97
Bit per Character (BPC)
· 2021-07-05
Long-Short Transformer: Efficient Transformers for Language and Vision
Code
#34
SRU++ Base
0.97
Bit per Character (BPC)
· 2021-02-24
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Code
#35
Sandwich Transformer (adaptive span)
0.968
Bit per Character (BPC)
· 2019-11-10
Improving Transformer Models by Reordering their Sublayers
Code
#36
Feedback Transformer
0.96
Bit per Character (BPC)
· 2020-02-21
Addressing Some Limitations of Transformers with Feedback Memory
Code
#37
Expire-Span (24 layers)
0.95
Bit per Character (BPC)
· 2021-05-13
Not All Memories are Created Equal: Learning to Forget by Expiring
Code
#38
SRU++ Large
0.95
Bit per Character (BPC)
· 2021-02-24
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Code
#39
Transformer-XL (24 layers, RMS dynamic eval, decay)
0.94
Bit per Character (BPC)
· Extra Data
· 2019-04-17
Dynamic Evaluation of Transformer Language Models
Code
#40
Focus
0.94
Bit per Character (BPC)
· 2023-05-24
Focus Your Attention (with Adaptive IIR Filters)
#41
GPT-2 (48 layers, h=1600)
0.93
Bit per Character (BPC)
· Extra Data
No paper
Code