Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Medical
/
Language Modelling
/
WikiText-103
Language Modelling on WikiText-103
Metric: Test perplexity (lower is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Test perplexity (best first)
Test perplexity (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Test perplexity
▲
Extra Data
Paper
Date
↕
Code
1
RETRO (7.5B)
2.4
Yes
Improving language models by retrieving from tri...
2021-12-08
Code
2
Hybrid H3 (2.7B)
10.6
Yes
Hungry Hungry Hippos: Towards Language Modeling ...
2022-12-28
Code
3
Megatron-LM
10.81
Yes
Megatron-LM: Training Multi-Billion Parameter La...
2019-09-17
Code
4
GLM-XXLarge (bidirectional)
11.33
Yes
GLM: General Language Model Pretraining with Aut...
2021-03-18
Code
5
GLM-XXLarge (unidirectional)
12.22
Yes
GLM: General Language Model Pretraining with Aut...
2021-03-18
Code
6
Hybrid H3 (1.3B)
12.5
Yes
Hungry Hungry Hippos: Towards Language Modeling ...
2022-12-28
Code
7
Ensemble of All
13.29
No
Advancing State of the Art in Language Modeling
2023-11-28
Code
8
GateLoop (125M)
13.4
No
GateLoop: Fully Data-Controlled Linear Recurrenc...
2023-11-03
Code
9
kNN-LM w/ Adaptive Coefficient
15.5
No
You can't pick your neighbors, or can you? When ...
2022-10-28
Code
10
kNN-LM w/ Continuous Cache
15.79
No
Generalization through Memorization: Nearest Nei...
2019-11-01
Code
11
Routing Transformer
15.8
No
Efficient Content-Based Sparse Attention with Ro...
2020-03-12
Code
12
kNN-LM
16.12
No
Generalization through Memorization: Nearest Nei...
2019-11-01
Code
13
Transformer-XL (RMS dynamic eval)
16.4
Yes
Dynamic Evaluation of Transformer Language Models
2019-04-17
Code
14
[?]-former (SM)
16.61
No
$\infty$-former: Infinite Memory Transformer
2021-09-01
Code
15
-former (SM)
16.61
No
$\infty$-former: Infinite Memory Transformer
2021-09-01
Code
16
∞-former (Sticky memories + initialized GPT-2 Small)
16.61
Yes
$\infty$-former: Infinite Memory Transformer
2021-09-01
Code
17
∞-former (initialized GPT-2 Small)
16.64
Yes
$\infty$-former: Infinite Memory Transformer
2021-09-01
Code
18
Hybrid H3 (355M)
16.9
Yes
Hungry Hungry Hippos: Towards Language Modeling ...
2022-12-28
Code
19
Transformer-XL (SGD dynamic eval)
17
No
Dynamic Evaluation of Transformer Language Models
2019-04-17
Code
20
Compressive Transformer (18L, M=1024)
17.1
No
Compressive Transformers for Long-Range Sequence...
2019-11-13
Code
21
SRU++ Large
17.1
No
When Attention Meets Fast Recurrence: Training L...
2021-02-24
Code
22
SegaTransformer-XL
17.1
No
Segatron: Segment-Aware Transformer for Language...
2020-04-30
Code
23
Transformer+SSA+Self-ensemble
17.18
No
The Information Pathways Hypothesis: Transformer...
2023-06-02
Code
24
Transformer-XL Large + Phrase Induction
17.4
No
Improving Neural Language Models by Segmenting, ...
2019-06-04
Code
25
GPT-2 Full
17.48
Yes
-
-
Code
26
Staged Training
17.56
No
Shortformer: Better Language Modeling using Shor...
2020-12-31
Code
27
Transformer+SSA
17.6
No
The Information Pathways Hypothesis: Transformer...
2023-06-02
Code
28
Sandwich Transformer
17.96
No
Improving Transformer Models by Reordering their...
2019-11-10
Code
29
DIFFQ (λ=1, g=16)
18
No
Differentiable Model Compression via Pseudo Quan...
2021-04-20
Code
30
Mega
18.07
No
Mega: Moving Average Equipped Gated Attention
2022-09-21
Code
31
Shortformer
18.15
No
Shortformer: Better Language Modeling using Shor...
2020-12-31
Code
32
Feedback Transformer (8 layers)
18.2
No
Addressing Some Limitations of Transformers with...
2020-02-21
Code
33
SRU++ Base
18.3
No
When Attention Meets Fast Recurrence: Training L...
2021-02-24
Code
34
Transformer-XL Large
18.3
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
35
PAR Transformer Large
18.4
No
Pay Attention when Required
2020-09-09
Code
36
Perceiver AR 358M
18.4
No
General-purpose, long-context autoregressive mod...
2022-02-15
Code
37
Hyena-3-slim
18.5
No
Hyena Hierarchy: Towards Larger Convolutional La...
2023-02-21
Code
38
Hybrid H3 125M
18.5
No
Hungry Hungry Hippos: Towards Language Modeling ...
2022-12-28
Code
39
Hyena-3
18.6
No
Hyena Hierarchy: Towards Larger Convolutional La...
2023-02-21
Code
40
Transformer (Adaptive inputs)
18.7
No
Adaptive Input Representations for Neural Langua...
2018-09-28
Code
41
T2R + Pretrain
19.6
No
Finetuning Pretrained Transformers into RNNs
2021-03-24
Code
42
Subformer
20.39
No
-
-
-
43
BERT-Large-CAS
20.4
No
Language Models with Transformers
2019-04-20
Code
44
All-attention network (36 layers)
20.6
No
Augmenting Self-attention with Persistent Memory
2019-07-02
Code
45
S4
21.28
No
Efficiently Modeling Long Sequences with Structu...
2021-10-31
Code
46
GPT-2 Large
22.05
Yes
-
-
Code
47
Feedback Transformer (4 layers)
22.4
No
Addressing Some Limitations of Transformers with...
2020-02-21
Code
48
PAR Transformer Base
22.7
No
Pay Attention when Required
2020-09-09
Code
49
Skip Cross-Head Transformer-XL
22.91
No
Memory-efficient Stochastic methods for Memory-b...
2023-11-14
Code
50
DEQ-Transformer (medium, adaptive embed)
23.2
No
Deep Equilibrium Models
2019-09-03
Code
51
TaLK Convolutions
23.3
No
Time-aware Large Kernel Convolutions
2020-02-08
Code
52
Rfa-Gate-Gaussian-Stateful (Big)
23.5
No
Random Feature Attention
2021-03-03
-
53
Hybrid H3 (125M)
23.7
Yes
Hungry Hungry Hippos: Towards Language Modeling ...
2022-12-28
Code
54
Transformer-XL Standard
24
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
55
DeLighT
24.14
No
DeLighT: Deep and Light-weight Transformer
2020-08-03
Code
56
[?]-former (Sticky memories)
24.22
No
$\infty$-former: Infinite Memory Transformer
2021-09-01
Code
57
\infty-former (Sticky memories)
24.22
No
$\infty$-former: Infinite Memory Transformer
2021-09-01
Code
58
∞-former (Sticky memories)
24.22
No
$\infty$-former: Infinite Memory Transformer
2021-09-01
Code
59
Transformer-N
25.2
No
Revisiting Simple Neural Probabilistic Language ...
2021-04-08
Code
60
Linear Attention 125M
25.6
No
Transformers are RNNs: Fast Autoregressive Trans...
2020-06-29
Code
61
FNetAR Medium
25.81
No
FNetAR: Mixing Tokens with Autoregressive Fourie...
2021-07-22
Code
62
Reformer 125M
26
No
Reformer: The Efficient Transformer
2020-01-13
Code
63
GPT-2 Medium
26.37
Yes
-
-
Code
64
Performer 125M
26.8
No
Rethinking Attention with Performers
2020-09-30
Code
65
AdvSoft (+ 4 layer QRNN + dynamic eval)
28
No
Improving Neural Language Modeling via Adversari...
2019-06-10
Code
66
DEQ-TrellisNet
29
No
Deep Equilibrium Models
2019-09-03
Code
67
Trellis Network
29.19
No
Trellis Networks for Sequence Modeling
2018-10-15
Code
68
LSTM (Hebbian, Cache, MbPA)
29.2
No
Fast Parametric Learning with Activation Memoriz...
2018-03-27
-
69
LSTM (Hebbian, Cache)
29.7
No
Fast Parametric Learning with Activation Memoriz...
2018-03-27
-
70
Rfa-Gate-Gaussian-Stateful (Small)
30.5
No
Random Feature Attention
2021-03-03
-
71
Primal.+Trans.
31
No
Primal-Attention: Self-attention through Asymmet...
2023-05-31
Code
72
LSTM (RMC)
31.6
No
Relational recurrent neural networks
2018-06-05
Code
73
DEQ-Transformer (small)
32.4
No
Deep Equilibrium Models
2019-09-03
Code
74
AWD-LSTM-MoS + ATOI
32.85
No
Alleviating Sequence Information Loss with Data ...
2019-09-18
Code
75
4 layer QRNN
33
No
An Analysis of Neural Language Modeling at Multi...
2018-03-22
Code
76
LSTM (Hebbian)
34.3
No
Fast Parametric Learning with Activation Memoriz...
2018-03-27
-
77
LSTM
36.4
No
Fast Parametric Learning with Activation Memoriz...
2018-03-27
-
78
GCNN-8
37.2
No
Language Modeling with Gated Convolutional Netwo...
2016-12-23
Code
79
GPT-2 Small
37.5
Yes
-
-
Code
80
Neural cache model (size = 2,000)
40.8
No
Improving Neural Language Models with a Continuo...
2016-12-13
Code
81
Neural cache model (size = 100)
44.8
No
Improving Neural Language Models with a Continuo...
2016-12-13
Code
82
GCNN-8
44.9
No
Language Modeling with Gated Convolutional Netwo...
2016-12-23
Code
83
TCN
45.19
No
An Empirical Evaluation of Generic Convolutional...
2018-03-04
Code
84
Temporal CNN
45.2
No
-
-
-
85
LSTM
48.7
No
Improving Neural Language Models with a Continuo...
2016-12-13
Code
#1
RETRO (7.5B)
SOTA
2.4
Test perplexity
· Extra Data
· 2021-12-08
Improving language models by retrieving from trillions of tokens
Code
#2
Hybrid H3 (2.7B)
10.6
Test perplexity
· Extra Data
· 2022-12-28
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Code
#3
Megatron-LM
SOTA
10.81
Test perplexity
· Extra Data
· 2019-09-17
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Code
#4
GLM-XXLarge (bidirectional)
11.33
Test perplexity
· Extra Data
· 2021-03-18
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
Code
#5
GLM-XXLarge (unidirectional)
12.22
Test perplexity
· Extra Data
· 2021-03-18
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
Code
#6
Hybrid H3 (1.3B)
12.5
Test perplexity
· Extra Data
· 2022-12-28
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Code
#7
Ensemble of All
13.29
Test perplexity
· 2023-11-28
Advancing State of the Art in Language Modeling
Code
#8
GateLoop (125M)
13.4
Test perplexity
· 2023-11-03
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
Code
#9
kNN-LM w/ Adaptive Coefficient
15.5
Test perplexity
· 2022-10-28
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM
Code
#10
kNN-LM w/ Continuous Cache
15.79
Test perplexity
· 2019-11-01
Generalization through Memorization: Nearest Neighbor Language Models
Code
#11
Routing Transformer
15.8
Test perplexity
· 2020-03-12
Efficient Content-Based Sparse Attention with Routing Transformers
Code
#12
kNN-LM
16.12
Test perplexity
· 2019-11-01
Generalization through Memorization: Nearest Neighbor Language Models
Code
#13
Transformer-XL (RMS dynamic eval)
SOTA
16.4
Test perplexity
· Extra Data
· 2019-04-17
Dynamic Evaluation of Transformer Language Models
Code
#14
[?]-former (SM)
16.61
Test perplexity
· 2021-09-01
$\infty$-former: Infinite Memory Transformer
Code
#15
-former (SM)
16.61
Test perplexity
· 2021-09-01
$\infty$-former: Infinite Memory Transformer
Code
#16
∞-former (Sticky memories + initialized GPT-2 Small)
16.61
Test perplexity
· Extra Data
· 2021-09-01
$\infty$-former: Infinite Memory Transformer
Code
#17
∞-former (initialized GPT-2 Small)
16.64
Test perplexity
· Extra Data
· 2021-09-01
$\infty$-former: Infinite Memory Transformer
Code
#18
Hybrid H3 (355M)
16.9
Test perplexity
· Extra Data
· 2022-12-28
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Code
#19
Transformer-XL (SGD dynamic eval)
SOTA
17
Test perplexity
· 2019-04-17
Dynamic Evaluation of Transformer Language Models
Code
#20
Compressive Transformer (18L, M=1024)
17.1
Test perplexity
· 2019-11-13
Compressive Transformers for Long-Range Sequence Modelling
Code
#21
SRU++ Large
17.1
Test perplexity
· 2021-02-24
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Code
#22
SegaTransformer-XL
17.1
Test perplexity
· 2020-04-30
Segatron: Segment-Aware Transformer for Language Modeling and Understanding
Code
#23
Transformer+SSA+Self-ensemble
17.18
Test perplexity
· 2023-06-02
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles
Code
#24
Transformer-XL Large + Phrase Induction
17.4
Test perplexity
· 2019-06-04
Improving Neural Language Models by Segmenting, Attending, and Predicting the Future
Code
#25
GPT-2 Full
17.48
Test perplexity
· Extra Data
No paper
Code
#26
Staged Training
17.56
Test perplexity
· 2020-12-31
Shortformer: Better Language Modeling using Shorter Inputs
Code
#27
Transformer+SSA
17.6
Test perplexity
· 2023-06-02
The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles
Code
#28
Sandwich Transformer
17.96
Test perplexity
· 2019-11-10
Improving Transformer Models by Reordering their Sublayers
Code
#29
DIFFQ (λ=1, g=16)
18
Test perplexity
· 2021-04-20
Differentiable Model Compression via Pseudo Quantization Noise
Code
#30
Mega
18.07
Test perplexity
· 2022-09-21
Mega: Moving Average Equipped Gated Attention
Code
#31
Shortformer
18.15
Test perplexity
· 2020-12-31
Shortformer: Better Language Modeling using Shorter Inputs
Code
#32
Feedback Transformer (8 layers)
18.2
Test perplexity
· 2020-02-21
Addressing Some Limitations of Transformers with Feedback Memory
Code
#33
SRU++ Base
18.3
Test perplexity
· 2021-02-24
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Code
#34
Transformer-XL Large
SOTA
18.3
Test perplexity
· 2019-01-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Code
#35
PAR Transformer Large
18.4
Test perplexity
· 2020-09-09
Pay Attention when Required
Code
#36
Perceiver AR 358M
18.4
Test perplexity
· 2022-02-15
General-purpose, long-context autoregressive modeling with Perceiver AR
Code
#37
Hyena-3-slim
18.5
Test perplexity
· 2023-02-21
Hyena Hierarchy: Towards Larger Convolutional Language Models
Code
#38
Hybrid H3 125M
18.5
Test perplexity
· 2022-12-28
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Code
#39
Hyena-3
18.6
Test perplexity
· 2023-02-21
Hyena Hierarchy: Towards Larger Convolutional Language Models
Code
#40
Transformer (Adaptive inputs)
SOTA
18.7
Test perplexity
· 2018-09-28
Adaptive Input Representations for Neural Language Modeling
Code
#41
T2R + Pretrain
19.6
Test perplexity
· 2021-03-24
Finetuning Pretrained Transformers into RNNs
Code
#42
Subformer
20.39
Test perplexity
No paper
#43
BERT-Large-CAS
20.4
Test perplexity
· 2019-04-20
Language Models with Transformers
Code
#44
All-attention network (36 layers)
20.6
Test perplexity
· 2019-07-02
Augmenting Self-attention with Persistent Memory
Code
#45
S4
21.28
Test perplexity
· 2021-10-31
Efficiently Modeling Long Sequences with Structured State Spaces
Code
#46
GPT-2 Large
22.05
Test perplexity
· Extra Data
No paper
Code
#47
Feedback Transformer (4 layers)
22.4
Test perplexity
· 2020-02-21
Addressing Some Limitations of Transformers with Feedback Memory
Code
#48
PAR Transformer Base
22.7
Test perplexity
· 2020-09-09
Pay Attention when Required
Code
#49
Skip Cross-Head Transformer-XL
22.91
Test perplexity
· 2023-11-14
Memory-efficient Stochastic methods for Memory-based Transformers
Code
#50
DEQ-Transformer (medium, adaptive embed)
23.2
Test perplexity
· 2019-09-03
Deep Equilibrium Models
Code
#51
TaLK Convolutions
23.3
Test perplexity
· 2020-02-08
Time-aware Large Kernel Convolutions
Code
#52
Rfa-Gate-Gaussian-Stateful (Big)
23.5
Test perplexity
· 2021-03-03
Random Feature Attention
#53
Hybrid H3 (125M)
23.7
Test perplexity
· Extra Data
· 2022-12-28
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Code
#54
Transformer-XL Standard
24
Test perplexity
· 2019-01-09
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Code
#55
DeLighT
24.14
Test perplexity
· 2020-08-03
DeLighT: Deep and Light-weight Transformer
Code
#56
[?]-former (Sticky memories)
24.22
Test perplexity
· 2021-09-01
$\infty$-former: Infinite Memory Transformer
Code
#57
\infty-former (Sticky memories)
24.22
Test perplexity
· 2021-09-01
$\infty$-former: Infinite Memory Transformer
Code
#58
∞-former (Sticky memories)
24.22
Test perplexity
· 2021-09-01
$\infty$-former: Infinite Memory Transformer
Code
#59
Transformer-N
25.2
Test perplexity
· 2021-04-08
Revisiting Simple Neural Probabilistic Language Models
Code
#60
Linear Attention 125M
25.6
Test perplexity
· 2020-06-29
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Code
#61
FNetAR Medium
25.81
Test perplexity
· 2021-07-22
FNetAR: Mixing Tokens with Autoregressive Fourier Transforms
Code
#62
Reformer 125M
26
Test perplexity
· 2020-01-13
Reformer: The Efficient Transformer
Code
#63
GPT-2 Medium
26.37
Test perplexity
· Extra Data
No paper
Code
#64
Performer 125M
26.8
Test perplexity
· 2020-09-30
Rethinking Attention with Performers
Code
#65
AdvSoft (+ 4 layer QRNN + dynamic eval)
28
Test perplexity
· 2019-06-10
Improving Neural Language Modeling via Adversarial Training
Code
#66
DEQ-TrellisNet
29
Test perplexity
· 2019-09-03
Deep Equilibrium Models
Code
#67
Trellis Network
29.19
Test perplexity
· 2018-10-15
Trellis Networks for Sequence Modeling
Code
#68
LSTM (Hebbian, Cache, MbPA)
SOTA
29.2
Test perplexity
· 2018-03-27
Fast Parametric Learning with Activation Memorization
#69
LSTM (Hebbian, Cache)
SOTA
29.7
Test perplexity
· 2018-03-27
Fast Parametric Learning with Activation Memorization
#70
Rfa-Gate-Gaussian-Stateful (Small)
30.5
Test perplexity
· 2021-03-03
Random Feature Attention
#71
Primal.+Trans.
31
Test perplexity
· 2023-05-31
Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation
Code
#72
LSTM (RMC)
31.6
Test perplexity
· 2018-06-05
Relational recurrent neural networks
Code
#73
DEQ-Transformer (small)
32.4
Test perplexity
· 2019-09-03
Deep Equilibrium Models
Code
#74
AWD-LSTM-MoS + ATOI
32.85
Test perplexity
· 2019-09-18
Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes
Code
#75
4 layer QRNN
SOTA
33
Test perplexity
· 2018-03-22
An Analysis of Neural Language Modeling at Multiple Scales
Code
#76
LSTM (Hebbian)
34.3
Test perplexity
· 2018-03-27
Fast Parametric Learning with Activation Memorization
#77
LSTM
36.4
Test perplexity
· 2018-03-27
Fast Parametric Learning with Activation Memorization
#78
GCNN-8
SOTA
37.2
Test perplexity
· 2016-12-23
Language Modeling with Gated Convolutional Networks
Code
#79
GPT-2 Small
37.5
Test perplexity
· Extra Data
No paper
Code
#80
Neural cache model (size = 2,000)
SOTA
40.8
Test perplexity
· 2016-12-13
Improving Neural Language Models with a Continuous Cache
Code
#81
Neural cache model (size = 100)
SOTA
44.8
Test perplexity
· 2016-12-13
Improving Neural Language Models with a Continuous Cache
Code
#82
GCNN-8
44.9
Test perplexity
· 2016-12-23
Language Modeling with Gated Convolutional Networks
Code
#83
TCN
45.19
Test perplexity
· 2018-03-04
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Code
#84
Temporal CNN
45.2
Test perplexity
No paper
#85
LSTM
SOTA
48.7
Test perplexity
· 2016-12-13
Improving Neural Language Models with a Continuous Cache
Code