Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Medical
/
Language Modelling
/
Text8
Language Modelling on Text8
Metric: Bit per Character (BPC) (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Bit per Character (BPC) (best first)
Bit per Character (BPC) (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Bit per Character (BPC)
▼
Extra Data
Paper
Date
↕
Code
1
td-LSTM (Zhang et al., 2016)
1.63
No
Architectural Complexity Measures of Recurrent N...
2016-02-26
-
2
td-LSTM-large
1.49
No
Architectural Complexity Measures of Recurrent N...
2016-02-26
-
3
BFN
1.41
No
Bayesian Flow Networks
2023-08-14
Code
4
Unregularised mLSTM
1.4
No
Multiplicative LSTM for sequence modelling
2016-09-26
Code
5
BN LSTM
1.36
No
Recurrent Batch Normalization
2016-03-30
Code
6
LayerNorm HM-LSTM
1.29
No
Hierarchical Multiscale Recurrent Neural Networks
2016-09-06
Code
7
Large RHN
1.27
No
Recurrent Highway Networks
2016-07-12
Code
8
Large mLSTM +emb +WN +VD
1.27
No
Multiplicative LSTM for sequence modelling
2016-09-26
Code
9
Bipartite flows (8 flows)
1.23
No
Discrete Flows: Invertible Generative Models of ...
2019-05-24
Code
10
mLSTM + dynamic eval
1.19
No
Dynamic Evaluation of Neural Sequence Models
2017-09-21
Code
11
12-layer Character Transformer Model
1.18
No
Character-Level Language Modeling with Deeper Se...
2018-08-09
Code
12
PAR Transformer 24B
1.18
No
Pay Attention when Required
2020-09-09
Code
13
GAM-RHN-10
1.157
No
-
-
Code
14
64-layer Character Transformer Model
1.13
No
Character-Level Language Modeling with Deeper Se...
2018-08-09
Code
15
12L Transformer + 8K adaptive span
1.11
No
Adaptive Attention Span in Transformers
2019-05-19
Code
16
All-attention network - 18 layers
1.11
No
Augmenting Self-attention with Persistent Memory
2019-07-02
Code
17
BP-Transformer - 12 Layers
1.11
No
BP-Transformer: Modelling Long-Range Context via...
2019-11-11
Code
18
Transformer-LS (small)
1.09
No
Long-Short Transformer: Efficient Transformers f...
2021-07-05
Code
19
Transformer-XL - 24 layers
1.08
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
20
All-attention network - 36 layers
1.08
No
Augmenting Self-attention with Persistent Memory
2019-07-02
Code
21
24L Transformer + 8K adaptive span
1.07
No
Adaptive Attention Span in Transformers
2019-05-19
Code
22
Transformer-XL + RMS dynamic eval + decay
1.038
No
Dynamic Evaluation of Transformer Language Models
2019-04-17
Code
23
GPT-2
0.98
Yes
-
-
Code
24
Focus
0.98
No
Focus Your Attention (with Adaptive IIR Filters)
2023-05-24
-
#1
td-LSTM (Zhang et al., 2016)
SOTA
1.63
Bit per Character (BPC)
· 2016-02-26
Architectural Complexity Measures of Recurrent Neural Networks
#2
td-LSTM-large
1.49
Bit per Character (BPC)
· 2016-02-26
Architectural Complexity Measures of Recurrent Neural Networks
#3
BFN
1.41
Bit per Character (BPC)
· 2023-08-14
Bayesian Flow Networks
Code
#4
Unregularised mLSTM
1.4
Bit per Character (BPC)
· 2016-09-26
Multiplicative LSTM for sequence modelling
Code
#5
BN LSTM
1.36
Bit per Character (BPC)
· 2016-03-30
Recurrent Batch Normalization
Code
#6
LayerNorm HM-LSTM
1.29
Bit per Character (BPC)
· 2016-09-06
Hierarchical Multiscale Recurrent Neural Networks
Code
#7
Large RHN
1.27
Bit per Character (BPC)
· 2016-07-12
Recurrent Highway Networks
Code
#8
Large mLSTM +emb +WN +VD
1.27
Bit per Character (BPC)
· 2016-09-26
Multiplicative LSTM for sequence modelling
Code
#9
Bipartite flows (8 flows)
1.23
Bit per Character (BPC)
· 2019-05-24
Discrete Flows: Invertible Generative Models of Discrete Data
Code
#10
mLSTM + dynamic eval
1.19
Bit per Character (BPC)
· 2017-09-21
Dynamic Evaluation of Neural Sequence Models
Code
#11
12-layer Character Transformer Model
1.18
Bit per Character (BPC)
· 2018-08-09
Character-Level Language Modeling with Deeper Self-Attention
Code
#12
PAR Transformer 24B
1.18
Bit per Character (BPC)
· 2020-09-09
Pay Attention when Required
Code
#13
GAM-RHN-10
1.157
Bit per Character (BPC)
No paper
Code
#14
64-layer Character Transformer Model
1.13
Bit per Character (BPC)
· 2018-08-09
Character-Level Language Modeling with Deeper Self-Attention
Code
#15
12L Transformer + 8K adaptive span
1.11
Bit per Character (BPC)
· 2019-05-19
Adaptive Attention Span in Transformers
Code
#16
All-attention network - 18 layers
1.11
Bit per Character (BPC)
· 2019-07-02
Augmenting Self-attention with Persistent Memory
Code
#17
BP-Transformer - 12 Layers
1.11
Bit per Character (BPC)
· 2019-11-11
BP-Transformer: Modelling Long-Range Context via Binary Partitioning
Code
#18
Transformer-LS (small)
1.09
Bit per Character (BPC)
· 2021-07-05
Long-Short Transformer: Efficient Transformers for Language and Vision
Code
#19
Transformer-XL - 24 layers
1.08
Bit per Character (BPC)
· 2019-01-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Code
#20
All-attention network - 36 layers
1.08
Bit per Character (BPC)
· 2019-07-02
Augmenting Self-attention with Persistent Memory
Code
#21
24L Transformer + 8K adaptive span
1.07
Bit per Character (BPC)
· 2019-05-19
Adaptive Attention Span in Transformers
Code
#22
Transformer-XL + RMS dynamic eval + decay
1.038
Bit per Character (BPC)
· 2019-04-17
Dynamic Evaluation of Transformer Language Models
Code
#23
GPT-2
0.98
Bit per Character (BPC)
· Extra Data
No paper
Code
#24
Focus
0.98
Bit per Character (BPC)
· 2023-05-24
Focus Your Attention (with Adaptive IIR Filters)