Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Medical
/
Language Modelling
/
One Billion Word
Language Modelling on One Billion Word
Metric: PPL (lower is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
PPL (best first)
PPL (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
PPL
▲
Extra Data
Paper
Date
↕
Code
1
MDLM (AR baseline)
20.09
No
Simple and Effective Masked Diffusion Language M...
2024-06-11
Code
2
OmniNetT (Large)
21.5
No
OmniNet: Omnidirectional Representations from Tr...
2021-03-01
Code
3
OmniNetP (Large)
21.6
No
OmniNet: Omnidirectional Representations from Tr...
2021-03-01
Code
4
Transformer-XL Large
21.8
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
5
OmniNetB (Large)
22
No
OmniNet: Omnidirectional Representations from Tr...
2021-03-01
Code
6
MDLM
23
No
Simple and Effective Masked Diffusion Language M...
2024-06-11
Code
7
Adaptive Input Very Large
23.02
No
Adaptive Input Representations for Neural Langua...
2018-09-28
Code
8
Transformer-XL Base
23.5
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
9
SRU++ Large
23.5
No
When Attention Meets Fast Recurrence: Training L...
2021-02-24
Code
10
10 LSTM+CNN inputs + SNM10-SKIP (ensemble)
23.7
No
Exploring the Limits of Language Modeling
2016-02-07
Code
11
Adaptive Input Large
23.91
No
Adaptive Input Representations for Neural Langua...
2018-09-28
Code
12
Mesh Tensorflow
24
No
Mesh-TensorFlow: Deep Learning for Supercomputers
2018-11-05
Code
13
Cohere Large
25.06
No
-
-
-
14
SRU++
25.1
No
When Attention Meets Fast Recurrence: Training L...
2021-02-24
Code
15
DynamicConv
26.67
No
Pay Less Attention with Lightweight and Dynamic ...
2019-01-29
Code
16
High-Budget MoE
28
No
Outrageously Large Neural Networks: The Sparsely...
2017-01-23
Code
17
Evolved Transformer Big
28.6
No
The Evolved Transformer
2019-01-30
Code
18
LSTM-8192-1024 + CNN Input
30
No
Exploring the Limits of Language Modeling
2016-02-07
Code
19
LSTM-8192-1024
30.6
No
Exploring the Limits of Language Modeling
2016-02-07
Code
20
GCNN-14 bottleneck
31.9
No
Language Modeling with Gated Convolutional Netwo...
2016-12-23
Code
21
Low-Budget MoE
34.1
No
Outrageously Large Neural Networks: The Sparsely...
2017-01-23
Code
22
BIG G-LSTM-2
36
No
Factorization tricks for LSTM networks
2017-03-31
Code
23
GPT-2
42.16
Yes
-
-
Code
24
RNN-1024 + 9 Gram
51.3
No
One Billion Word Benchmark for Measuring Progres...
2013-12-11
Code
25
Sparse Non-Negative
52.9
No
Skip-gram Language Modeling Using Sparse Non-neg...
2014-12-03
-
#1
MDLM (AR baseline)
SOTA
20.09
PPL
· 2024-06-11
Simple and Effective Masked Diffusion Language Models
Code
#2
OmniNetT (Large)
SOTA
21.5
PPL
· 2021-03-01
OmniNet: Omnidirectional Representations from Transformers
Code
#3
OmniNetP (Large)
SOTA
21.6
PPL
· 2021-03-01
OmniNet: Omnidirectional Representations from Transformers
Code
#4
Transformer-XL Large
SOTA
21.8
PPL
· 2019-01-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Code
#5
OmniNetB (Large)
22
PPL
· 2021-03-01
OmniNet: Omnidirectional Representations from Transformers
Code
#6
MDLM
23
PPL
· 2024-06-11
Simple and Effective Masked Diffusion Language Models
Code
#7
Adaptive Input Very Large
SOTA
23.02
PPL
· 2018-09-28
Adaptive Input Representations for Neural Language Modeling
Code
#8
Transformer-XL Base
23.5
PPL
· 2019-01-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Code
#9
SRU++ Large
23.5
PPL
· 2021-02-24
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Code
#10
10 LSTM+CNN inputs + SNM10-SKIP (ensemble)
SOTA
23.7
PPL
· 2016-02-07
Exploring the Limits of Language Modeling
Code
#11
Adaptive Input Large
23.91
PPL
· 2018-09-28
Adaptive Input Representations for Neural Language Modeling
Code
#12
Mesh Tensorflow
24
PPL
· 2018-11-05
Mesh-TensorFlow: Deep Learning for Supercomputers
Code
#13
Cohere Large
25.06
PPL
No paper
#14
SRU++
25.1
PPL
· 2021-02-24
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Code
#15
DynamicConv
26.67
PPL
· 2019-01-29
Pay Less Attention with Lightweight and Dynamic Convolutions
Code
#16
High-Budget MoE
28
PPL
· 2017-01-23
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Code
#17
Evolved Transformer Big
28.6
PPL
· 2019-01-30
The Evolved Transformer
Code
#18
LSTM-8192-1024 + CNN Input
SOTA
30
PPL
· 2016-02-07
Exploring the Limits of Language Modeling
Code
#19
LSTM-8192-1024
SOTA
30.6
PPL
· 2016-02-07
Exploring the Limits of Language Modeling
Code
#20
GCNN-14 bottleneck
31.9
PPL
· 2016-12-23
Language Modeling with Gated Convolutional Networks
Code
#21
Low-Budget MoE
34.1
PPL
· 2017-01-23
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Code
#22
BIG G-LSTM-2
36
PPL
· 2017-03-31
Factorization tricks for LSTM networks
Code
#23
GPT-2
42.16
PPL
· Extra Data
No paper
Code
#24
RNN-1024 + 9 Gram
SOTA
51.3
PPL
· 2013-12-11
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Code
#25
Sparse Non-Negative
52.9
PPL
· 2014-12-03
Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation