Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Medical
/
Language Modelling
/
WikiText-103
Language Modelling on WikiText-103
Metric: Validation perplexity (lower is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
Validation perplexity
▲
Extra Data
Paper
Date
↕
Code
1
Ensemble of All
13.11
No
Advancing State of the Art in Language Modeling
2023-11-28
Code
2
kNN-LM w/ Adaptive Coefficient
15.72
No
You can't pick your neighbors, or can you? When ...
2022-10-28
Code
3
Transformer-XL (RMS dynamic eval)
15.8
Yes
Dynamic Evaluation of Transformer Language Models
2019-04-17
Code
4
kNN-LM w/ Continuous Cache
15.81
No
Generalization through Memorization: Nearest Nei...
2019-11-01
Code
5
Compressive Transformer (18L, M=1024)
16
No
Compressive Transformers for Long-Range Sequence...
2019-11-13
Code
6
kNN-LM
16.06
No
Generalization through Memorization: Nearest Nei...
2019-11-01
Code
7
Transformer-XL (SGD dynamic eval)
16.3
No
Dynamic Evaluation of Transformer Language Models
2019-04-17
Code
8
SRU++ Large
16.4
No
When Attention Meets Fast Recurrence: Training L...
2021-02-24
Code
9
Transformer+SSA+Self-ensemble
16.54
No
The Information Pathways Hypothesis: Transformer...
2023-06-02
Code
10
Staged Training
16.89
No
Shortformer: Better Language Modeling using Shor...
2020-12-31
Code
11
Transformer+SSA
16.91
No
The Information Pathways Hypothesis: Transformer...
2023-06-02
Code
12
Shortformer
17.47
No
Shortformer: Better Language Modeling using Shor...
2020-12-31
Code
13
Feedback Transformer (8 layers)
17.5
No
Addressing Some Limitations of Transformers with...
2020-02-21
Code
14
SRU++ Base
17.5
No
When Attention Meets Fast Recurrence: Training L...
2021-02-24
Code
15
Transformer (Adaptive inputs)
17.97
No
Adaptive Input Representations for Neural Langua...
2018-09-28
Code
16
Transformer-XL Large
18.2
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
17
T2R + Pretrain
19
No
Finetuning Pretrained Transformers into RNNs
2021-03-24
Code
18
Transformer (Adaptive inputs)
19.5
No
On the adequacy of untuned warmup for adaptive o...
2019-10-09
Code
19
BERT-Large-CAS
19.6
No
Language Models with Transformers
2019-04-20
Code
20
All-attention network (36 layers)
19.7
No
Augmenting Self-attention with Persistent Memory
2019-07-02
Code
21
Feedback Transformer (4 layers)
21.4
No
Addressing Some Limitations of Transformers with...
2020-02-21
Code
22
Skip Cross-Head Transformer-XL
21.87
No
Memory-efficient Stochastic methods for Memory-b...
2023-11-14
Code
23
Rfa-Gate-Gaussian-Stateful (Big)
22
No
Random Feature Attention
2021-03-03
-
24
Transformer-XL Standard
23.1
No
Transformer-XL: Attentive Language Models Beyond...
2019-01-09
Code
25
Transformer-N
24.1
No
Revisiting Simple Neural Probabilistic Language ...
2021-04-08
Code
26
AdvSoft (+ 4 layer QRNN + dynamic eval)
27.2
No
Improving Neural Language Modeling via Adversari...
2019-06-10
Code
27
LSTM (Hebbian, Cache, MbPA)
29
No
Fast Parametric Learning with Activation Memoriz...
2018-03-27
-
28
Rfa-Gate-Gaussian-Stateful (Small)
29.4
No
Random Feature Attention
2021-03-03
-
29
LSTM (Hebbian, Cache)
29.9
No
Fast Parametric Learning with Activation Memoriz...
2018-03-27
-
30
LSTM (RMC)
30.8
No
Relational recurrent neural networks
2018-06-05
Code
31
AWD-LSTM-MoS + ATOI
31.92
No
Alleviating Sequence Information Loss with Data ...
2019-09-18
Code
32
4 layer QRNN
32
No
An Analysis of Neural Language Modeling at Multi...
2018-03-22
Code
33
LSTM (Hebbian)
34.1
No
Fast Parametric Learning with Activation Memoriz...
2018-03-27
-
34
LSTM
36
No
Fast Parametric Learning with Activation Memoriz...
2018-03-27
-
35
LSTM
52.73
No
How much complexity does an RNN architecture nee...
2020-05-17
Code
36
GRU
53.78
No
How much complexity does an RNN architecture nee...
2020-05-17
Code
37
Decay RNN
76.67
No
How much complexity does an RNN architecture nee...
2020-05-17
Code