TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Medical/Language Modelling/WikiText-103

Language Modelling on WikiText-103

Metric: Validation perplexity (lower is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Validation perplexity▲Extra DataPaperDate↕Code
1Ensemble of All13.11NoAdvancing State of the Art in Language Modeling2023-11-28Code
2kNN-LM w/ Adaptive Coefficient15.72NoYou can't pick your neighbors, or can you? When ...2022-10-28Code
3Transformer-XL (RMS dynamic eval)15.8YesDynamic Evaluation of Transformer Language Models2019-04-17Code
4kNN-LM w/ Continuous Cache15.81NoGeneralization through Memorization: Nearest Nei...2019-11-01Code
5Compressive Transformer (18L, M=1024)16NoCompressive Transformers for Long-Range Sequence...2019-11-13Code
6kNN-LM16.06NoGeneralization through Memorization: Nearest Nei...2019-11-01Code
7Transformer-XL (SGD dynamic eval)16.3NoDynamic Evaluation of Transformer Language Models2019-04-17Code
8SRU++ Large16.4NoWhen Attention Meets Fast Recurrence: Training L...2021-02-24Code
9Transformer+SSA+Self-ensemble16.54NoThe Information Pathways Hypothesis: Transformer...2023-06-02Code
10Staged Training16.89NoShortformer: Better Language Modeling using Shor...2020-12-31Code
11Transformer+SSA16.91NoThe Information Pathways Hypothesis: Transformer...2023-06-02Code
12Shortformer17.47NoShortformer: Better Language Modeling using Shor...2020-12-31Code
13Feedback Transformer (8 layers)17.5NoAddressing Some Limitations of Transformers with...2020-02-21Code
14SRU++ Base17.5NoWhen Attention Meets Fast Recurrence: Training L...2021-02-24Code
15Transformer (Adaptive inputs)17.97NoAdaptive Input Representations for Neural Langua...2018-09-28Code
16Transformer-XL Large18.2NoTransformer-XL: Attentive Language Models Beyond...2019-01-09Code
17T2R + Pretrain19NoFinetuning Pretrained Transformers into RNNs2021-03-24Code
18Transformer (Adaptive inputs)19.5NoOn the adequacy of untuned warmup for adaptive o...2019-10-09Code
19BERT-Large-CAS19.6NoLanguage Models with Transformers2019-04-20Code
20All-attention network (36 layers)19.7NoAugmenting Self-attention with Persistent Memory2019-07-02Code
21Feedback Transformer (4 layers)21.4NoAddressing Some Limitations of Transformers with...2020-02-21Code
22Skip Cross-Head Transformer-XL21.87NoMemory-efficient Stochastic methods for Memory-b...2023-11-14Code
23Rfa-Gate-Gaussian-Stateful (Big)22NoRandom Feature Attention2021-03-03-
24Transformer-XL Standard23.1NoTransformer-XL: Attentive Language Models Beyond...2019-01-09Code
25Transformer-N24.1NoRevisiting Simple Neural Probabilistic Language ...2021-04-08Code
26AdvSoft (+ 4 layer QRNN + dynamic eval)27.2NoImproving Neural Language Modeling via Adversari...2019-06-10Code
27LSTM (Hebbian, Cache, MbPA)29NoFast Parametric Learning with Activation Memoriz...2018-03-27-
28Rfa-Gate-Gaussian-Stateful (Small)29.4NoRandom Feature Attention2021-03-03-
29LSTM (Hebbian, Cache)29.9NoFast Parametric Learning with Activation Memoriz...2018-03-27-
30LSTM (RMC)30.8NoRelational recurrent neural networks2018-06-05Code
31AWD-LSTM-MoS + ATOI31.92NoAlleviating Sequence Information Loss with Data ...2019-09-18Code
324 layer QRNN32NoAn Analysis of Neural Language Modeling at Multi...2018-03-22Code
33LSTM (Hebbian)34.1NoFast Parametric Learning with Activation Memoriz...2018-03-27-
34LSTM36NoFast Parametric Learning with Activation Memoriz...2018-03-27-
35LSTM52.73NoHow much complexity does an RNN architecture nee...2020-05-17Code
36GRU53.78NoHow much complexity does an RNN architecture nee...2020-05-17Code
37Decay RNN76.67NoHow much complexity does an RNN architecture nee...2020-05-17Code