Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Medical
/
Language Modelling
/
WikiText-2
Language Modelling on WikiText-2
Metric: Test perplexity (lower is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Test perplexity (best first)
Test perplexity (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Test perplexity
▲
Extra Data
Paper
Date
↕
Code
1
SparseGPT (175B, 50% Sparsity)
8.21
Yes
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
2
OPT-175B
8.34
Yes
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
3
SparseGPT (175B, 4:8 Sparsity)
8.45
Yes
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
4
SparseGPT (175B, 2:4 Sparsity)
8.73
Yes
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
5
GPT-2 (fine-tuned)
15.17
Yes
Hydra: A System for Large Multi-Model Deep Learn...
2021-10-16
Code
6
GPT-2
18.34
Yes
-
-
Code
7
GPT-2 (large)
19.93
Yes
-
-
Code
8
GPT-2 (medium)
22.76
Yes
-
-
Code
9
GPT-2 (small)
29.41
Yes
-
-
Code
10
BERT-Large-CAS
34.1
Yes
Language Models with Transformers
2019-04-20
Code
11
Mogrifier LSTM + dynamic eval
38.6
No
Mogrifier LSTM
2019-09-04
Code
12
adversarial + AWD-LSTM-MoS + dynamic eval
38.65
No
Improving Neural Language Modeling via Adversari...
2019-06-10
Code
13
FRAGE + AWD-LSTM-MoS + dynamic eval
39.14
No
FRAGE: Frequency-Agnostic Word Representation
2018-09-18
Code
14
Past Decode Reg. + AWD-LSTM-MoS + dyn. eval.
40.3
No
Improved Language Modeling by Decoding the Past
2018-08-14
-
15
GL-LWGC + AWD-MoS-LSTM + dynamic eval
40.46
No
Gradual Learning of Recurrent Neural Networks
2017-08-29
Code
16
AWD-LSTM-MoS + dynamic eval
40.68
No
Breaking the Softmax Bottleneck: A High-Rank RNN...
2017-11-10
Code
17
AWD-LSTM-DRILL + dynamic eval
42
No
Deep Residual Output Layers for Neural Language ...
2019-05-14
Code
18
AWD-LSTM + dynamic eval
44.3
No
Dynamic Evaluation of Neural Sequence Models
2017-09-21
Code
19
AWD-LSTM + continuous cache pointer
52
No
Regularizing and Optimizing LSTM Language Models
2017-08-07
Code
20
AWD-LSTM-DOC x5
53.09
No
Direct Output Connection for a High-Rank Languag...
2018-08-30
Code
21
Ensemble of All
53.73
No
Advancing State of the Art in Language Modeling
2023-11-28
Code
22
Mogrifier LSTM
55.1
No
Mogrifier LSTM
2019-09-04
Code
23
AWD-LSTM-DOC + Partial Shuffle
57.85
No
Partially Shuffling the Training Data to Improve...
2019-03-11
Code
24
AWD-LSTM-DOC
58.03
No
Direct Output Connection for a High-Rank Languag...
2018-08-30
Code
25
AWD-LSTM-MoS + Partial Shuffle
59.98
No
Partially Shuffling the Training Data to Improve...
2019-03-11
Code
26
AWD-LSTM-MoS
61.45
No
Breaking the Softmax Bottleneck: A High-Rank RNN...
2017-11-10
Code
27
AWD-FWM Schlag et al. (2020)
61.65
No
Learning Associative Inference Using Fast Weight...
2020-11-16
Code
28
AWD-LSTM-DRILL
61.9
No
Deep Residual Output Layers for Neural Language ...
2019-05-14
Code
29
AWD-LSTM 3-layer with Fraternal dropout
64.1
No
Fraternal Dropout
2017-10-31
Code
30
AWD-LSTM + ATOI
64.73
No
Alleviating Sequence Information Loss with Data ...
2019-09-18
Code
31
AWD-LSTM
65.8
No
Regularizing and Optimizing LSTM Language Models
2017-08-07
Code
32
Melis et al. (2017) - 1-layer LSTM (tied)
65.9
No
On the State of the Art of Evaluation in Neural ...
2017-07-18
Code
33
Grave et al. (2016) - LSTM + continuous cache pointer
68.9
No
Improving Neural Language Models with a Continuo...
2016-12-13
Code
34
EGRU
68.9
No
Efficient recurrent architectures through activi...
2022-06-13
Code
35
Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss
87
No
Tying Word Vectors and Word Classifiers: A Loss ...
2016-11-04
Code
36
Inan et al. (2016) - Variational LSTM (tied) (h=650)
87.7
No
Tying Word Vectors and Word Classifiers: A Loss ...
2016-11-04
Code
37
Grave et al. (2016) - LSTM
99.3
No
Improving Neural Language Models with a Continuo...
2016-12-13
Code
38
OPT-175B (50% Sparsity)
234.77
Yes
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
#1
SparseGPT (175B, 50% Sparsity)
SOTA
8.21
Test perplexity
· Extra Data
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code
#2
OPT-175B
SOTA
8.34
Test perplexity
· Extra Data
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code
#3
SparseGPT (175B, 4:8 Sparsity)
SOTA
8.45
Test perplexity
· Extra Data
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code
#4
SparseGPT (175B, 2:4 Sparsity)
SOTA
8.73
Test perplexity
· Extra Data
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code
#5
GPT-2 (fine-tuned)
SOTA
15.17
Test perplexity
· Extra Data
· 2021-10-16
Hydra: A System for Large Multi-Model Deep Learning
Code
#6
GPT-2
18.34
Test perplexity
· Extra Data
No paper
Code
#7
GPT-2 (large)
19.93
Test perplexity
· Extra Data
No paper
Code
#8
GPT-2 (medium)
22.76
Test perplexity
· Extra Data
No paper
Code
#9
GPT-2 (small)
29.41
Test perplexity
· Extra Data
No paper
Code
#10
BERT-Large-CAS
SOTA
34.1
Test perplexity
· Extra Data
· 2019-04-20
Language Models with Transformers
Code
#11
Mogrifier LSTM + dynamic eval
38.6
Test perplexity
· 2019-09-04
Mogrifier LSTM
Code
#12
adversarial + AWD-LSTM-MoS + dynamic eval
38.65
Test perplexity
· 2019-06-10
Improving Neural Language Modeling via Adversarial Training
Code
#13
FRAGE + AWD-LSTM-MoS + dynamic eval
SOTA
39.14
Test perplexity
· 2018-09-18
FRAGE: Frequency-Agnostic Word Representation
Code
#14
Past Decode Reg. + AWD-LSTM-MoS + dyn. eval.
SOTA
40.3
Test perplexity
· 2018-08-14
Improved Language Modeling by Decoding the Past
#15
GL-LWGC + AWD-MoS-LSTM + dynamic eval
SOTA
40.46
Test perplexity
· 2017-08-29
Gradual Learning of Recurrent Neural Networks
Code
#16
AWD-LSTM-MoS + dynamic eval
40.68
Test perplexity
· 2017-11-10
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Code
#17
AWD-LSTM-DRILL + dynamic eval
42
Test perplexity
· 2019-05-14
Deep Residual Output Layers for Neural Language Generation
Code
#18
AWD-LSTM + dynamic eval
44.3
Test perplexity
· 2017-09-21
Dynamic Evaluation of Neural Sequence Models
Code
#19
AWD-LSTM + continuous cache pointer
SOTA
52
Test perplexity
· 2017-08-07
Regularizing and Optimizing LSTM Language Models
Code
#20
AWD-LSTM-DOC x5
53.09
Test perplexity
· 2018-08-30
Direct Output Connection for a High-Rank Language Model
Code
#21
Ensemble of All
53.73
Test perplexity
· 2023-11-28
Advancing State of the Art in Language Modeling
Code
#22
Mogrifier LSTM
55.1
Test perplexity
· 2019-09-04
Mogrifier LSTM
Code
#23
AWD-LSTM-DOC + Partial Shuffle
57.85
Test perplexity
· 2019-03-11
Partially Shuffling the Training Data to Improve Language Models
Code
#24
AWD-LSTM-DOC
58.03
Test perplexity
· 2018-08-30
Direct Output Connection for a High-Rank Language Model
Code
#25
AWD-LSTM-MoS + Partial Shuffle
59.98
Test perplexity
· 2019-03-11
Partially Shuffling the Training Data to Improve Language Models
Code
#26
AWD-LSTM-MoS
61.45
Test perplexity
· 2017-11-10
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Code
#27
AWD-FWM Schlag et al. (2020)
61.65
Test perplexity
· 2020-11-16
Learning Associative Inference Using Fast Weight Memory
Code
#28
AWD-LSTM-DRILL
61.9
Test perplexity
· 2019-05-14
Deep Residual Output Layers for Neural Language Generation
Code
#29
AWD-LSTM 3-layer with Fraternal dropout
64.1
Test perplexity
· 2017-10-31
Fraternal Dropout
Code
#30
AWD-LSTM + ATOI
64.73
Test perplexity
· 2019-09-18
Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes
Code
#31
AWD-LSTM
SOTA
65.8
Test perplexity
· 2017-08-07
Regularizing and Optimizing LSTM Language Models
Code
#32
Melis et al. (2017) - 1-layer LSTM (tied)
SOTA
65.9
Test perplexity
· 2017-07-18
On the State of the Art of Evaluation in Neural Language Models
Code
#33
Grave et al. (2016) - LSTM + continuous cache pointer
SOTA
68.9
Test perplexity
· 2016-12-13
Improving Neural Language Models with a Continuous Cache
Code
#34
EGRU
68.9
Test perplexity
· 2022-06-13
Efficient recurrent architectures through activity sparsity and sparse back-propagation through time
Code
#35
Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss
SOTA
87
Test perplexity
· 2016-11-04
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Code
#36
Inan et al. (2016) - Variational LSTM (tied) (h=650)
SOTA
87.7
Test perplexity
· 2016-11-04
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Code
#37
Grave et al. (2016) - LSTM
99.3
Test perplexity
· 2016-12-13
Improving Neural Language Models with a Continuous Cache
Code
#38
OPT-175B (50% Sparsity)
234.77
Test perplexity
· Extra Data
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code