TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Medical/Language Modelling/WikiText-2

Language Modelling on WikiText-2

Metric: Test perplexity (lower is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Test perplexity▲Extra DataPaperDate↕Code
1SparseGPT (175B, 50% Sparsity)8.21YesSparseGPT: Massive Language Models Can Be Accura...2023-01-02Code
2OPT-175B8.34YesSparseGPT: Massive Language Models Can Be Accura...2023-01-02Code
3SparseGPT (175B, 4:8 Sparsity)8.45YesSparseGPT: Massive Language Models Can Be Accura...2023-01-02Code
4SparseGPT (175B, 2:4 Sparsity)8.73YesSparseGPT: Massive Language Models Can Be Accura...2023-01-02Code
5GPT-2 (fine-tuned)15.17YesHydra: A System for Large Multi-Model Deep Learn...2021-10-16Code
6GPT-218.34Yes--Code
7GPT-2 (large)19.93Yes--Code
8GPT-2 (medium)22.76Yes--Code
9GPT-2 (small)29.41Yes--Code
10BERT-Large-CAS34.1YesLanguage Models with Transformers2019-04-20Code
11Mogrifier LSTM + dynamic eval38.6NoMogrifier LSTM2019-09-04Code
12adversarial + AWD-LSTM-MoS + dynamic eval38.65NoImproving Neural Language Modeling via Adversari...2019-06-10Code
13FRAGE + AWD-LSTM-MoS + dynamic eval39.14NoFRAGE: Frequency-Agnostic Word Representation2018-09-18Code
14Past Decode Reg. + AWD-LSTM-MoS + dyn. eval.40.3NoImproved Language Modeling by Decoding the Past2018-08-14-
15GL-LWGC + AWD-MoS-LSTM + dynamic eval40.46NoGradual Learning of Recurrent Neural Networks2017-08-29Code
16AWD-LSTM-MoS + dynamic eval40.68NoBreaking the Softmax Bottleneck: A High-Rank RNN...2017-11-10Code
17AWD-LSTM-DRILL + dynamic eval42NoDeep Residual Output Layers for Neural Language ...2019-05-14Code
18AWD-LSTM + dynamic eval44.3NoDynamic Evaluation of Neural Sequence Models2017-09-21Code
19AWD-LSTM + continuous cache pointer52NoRegularizing and Optimizing LSTM Language Models2017-08-07Code
20AWD-LSTM-DOC x553.09NoDirect Output Connection for a High-Rank Languag...2018-08-30Code
21Ensemble of All53.73NoAdvancing State of the Art in Language Modeling2023-11-28Code
22Mogrifier LSTM55.1NoMogrifier LSTM2019-09-04Code
23AWD-LSTM-DOC + Partial Shuffle57.85NoPartially Shuffling the Training Data to Improve...2019-03-11Code
24AWD-LSTM-DOC58.03NoDirect Output Connection for a High-Rank Languag...2018-08-30Code
25AWD-LSTM-MoS + Partial Shuffle59.98NoPartially Shuffling the Training Data to Improve...2019-03-11Code
26AWD-LSTM-MoS61.45NoBreaking the Softmax Bottleneck: A High-Rank RNN...2017-11-10Code
27AWD-FWM Schlag et al. (2020)61.65NoLearning Associative Inference Using Fast Weight...2020-11-16Code
28AWD-LSTM-DRILL61.9NoDeep Residual Output Layers for Neural Language ...2019-05-14Code
29AWD-LSTM 3-layer with Fraternal dropout64.1NoFraternal Dropout2017-10-31Code
30AWD-LSTM + ATOI64.73NoAlleviating Sequence Information Loss with Data ...2019-09-18Code
31AWD-LSTM65.8NoRegularizing and Optimizing LSTM Language Models2017-08-07Code
32Melis et al. (2017) - 1-layer LSTM (tied)65.9NoOn the State of the Art of Evaluation in Neural ...2017-07-18Code
33Grave et al. (2016) - LSTM + continuous cache pointer68.9NoImproving Neural Language Models with a Continuo...2016-12-13Code
34EGRU68.9NoEfficient recurrent architectures through activi...2022-06-13Code
35Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss87NoTying Word Vectors and Word Classifiers: A Loss ...2016-11-04Code
36Inan et al. (2016) - Variational LSTM (tied) (h=650)87.7NoTying Word Vectors and Word Classifiers: A Loss ...2016-11-04Code
37Grave et al. (2016) - LSTM99.3NoImproving Neural Language Models with a Continuo...2016-12-13Code
38OPT-175B (50% Sparsity)234.77YesSparseGPT: Massive Language Models Can Be Accura...2023-01-02Code