TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Medical/Language Modelling/WikiText-103

Language Modelling on WikiText-103

Metric: Test perplexity (lower is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Test perplexity▲Extra DataPaperDate↕Code
1RETRO (7.5B)2.4YesImproving language models by retrieving from tri...2021-12-08Code
2Hybrid H3 (2.7B)10.6YesHungry Hungry Hippos: Towards Language Modeling ...2022-12-28Code
3Megatron-LM10.81YesMegatron-LM: Training Multi-Billion Parameter La...2019-09-17Code
4GLM-XXLarge (bidirectional)11.33YesGLM: General Language Model Pretraining with Aut...2021-03-18Code
5GLM-XXLarge (unidirectional)12.22YesGLM: General Language Model Pretraining with Aut...2021-03-18Code
6Hybrid H3 (1.3B)12.5YesHungry Hungry Hippos: Towards Language Modeling ...2022-12-28Code
7Ensemble of All13.29NoAdvancing State of the Art in Language Modeling2023-11-28Code
8GateLoop (125M)13.4NoGateLoop: Fully Data-Controlled Linear Recurrenc...2023-11-03Code
9kNN-LM w/ Adaptive Coefficient15.5NoYou can't pick your neighbors, or can you? When ...2022-10-28Code
10kNN-LM w/ Continuous Cache15.79NoGeneralization through Memorization: Nearest Nei...2019-11-01Code
11Routing Transformer15.8NoEfficient Content-Based Sparse Attention with Ro...2020-03-12Code
12kNN-LM16.12NoGeneralization through Memorization: Nearest Nei...2019-11-01Code
13Transformer-XL (RMS dynamic eval)16.4YesDynamic Evaluation of Transformer Language Models2019-04-17Code
14[?]-former (SM)16.61No$\infty$-former: Infinite Memory Transformer2021-09-01Code
15-former (SM)16.61No$\infty$-former: Infinite Memory Transformer2021-09-01Code
16∞-former (Sticky memories + initialized GPT-2 Small)16.61Yes$\infty$-former: Infinite Memory Transformer2021-09-01Code
17∞-former (initialized GPT-2 Small)16.64Yes$\infty$-former: Infinite Memory Transformer2021-09-01Code
18Hybrid H3 (355M)16.9YesHungry Hungry Hippos: Towards Language Modeling ...2022-12-28Code
19Transformer-XL (SGD dynamic eval)17NoDynamic Evaluation of Transformer Language Models2019-04-17Code
20Compressive Transformer (18L, M=1024)17.1NoCompressive Transformers for Long-Range Sequence...2019-11-13Code
21SRU++ Large17.1NoWhen Attention Meets Fast Recurrence: Training L...2021-02-24Code
22SegaTransformer-XL17.1NoSegatron: Segment-Aware Transformer for Language...2020-04-30Code
23Transformer+SSA+Self-ensemble17.18NoThe Information Pathways Hypothesis: Transformer...2023-06-02Code
24Transformer-XL Large + Phrase Induction17.4NoImproving Neural Language Models by Segmenting, ...2019-06-04Code
25GPT-2 Full17.48Yes--Code
26Staged Training17.56NoShortformer: Better Language Modeling using Shor...2020-12-31Code
27Transformer+SSA17.6NoThe Information Pathways Hypothesis: Transformer...2023-06-02Code
28Sandwich Transformer17.96NoImproving Transformer Models by Reordering their...2019-11-10Code
29DIFFQ (λ=1, g=16)18NoDifferentiable Model Compression via Pseudo Quan...2021-04-20Code
30Mega18.07NoMega: Moving Average Equipped Gated Attention2022-09-21Code
31Shortformer18.15NoShortformer: Better Language Modeling using Shor...2020-12-31Code
32Feedback Transformer (8 layers)18.2NoAddressing Some Limitations of Transformers with...2020-02-21Code
33SRU++ Base18.3NoWhen Attention Meets Fast Recurrence: Training L...2021-02-24Code
34Transformer-XL Large18.3NoTransformer-XL: Attentive Language Models Beyond...2019-01-09Code
35PAR Transformer Large18.4NoPay Attention when Required2020-09-09Code
36Perceiver AR 358M18.4NoGeneral-purpose, long-context autoregressive mod...2022-02-15Code
37Hyena-3-slim18.5NoHyena Hierarchy: Towards Larger Convolutional La...2023-02-21Code
38Hybrid H3 125M18.5NoHungry Hungry Hippos: Towards Language Modeling ...2022-12-28Code
39Hyena-318.6NoHyena Hierarchy: Towards Larger Convolutional La...2023-02-21Code
40Transformer (Adaptive inputs)18.7NoAdaptive Input Representations for Neural Langua...2018-09-28Code
41T2R + Pretrain19.6NoFinetuning Pretrained Transformers into RNNs2021-03-24Code
42Subformer20.39No---
43BERT-Large-CAS20.4NoLanguage Models with Transformers2019-04-20Code
44All-attention network (36 layers)20.6NoAugmenting Self-attention with Persistent Memory2019-07-02Code
45S421.28NoEfficiently Modeling Long Sequences with Structu...2021-10-31Code
46GPT-2 Large22.05Yes--Code
47Feedback Transformer (4 layers)22.4NoAddressing Some Limitations of Transformers with...2020-02-21Code
48PAR Transformer Base22.7NoPay Attention when Required2020-09-09Code
49Skip Cross-Head Transformer-XL22.91NoMemory-efficient Stochastic methods for Memory-b...2023-11-14Code
50DEQ-Transformer (medium, adaptive embed)23.2NoDeep Equilibrium Models2019-09-03Code
51TaLK Convolutions23.3NoTime-aware Large Kernel Convolutions2020-02-08Code
52Rfa-Gate-Gaussian-Stateful (Big)23.5NoRandom Feature Attention2021-03-03-
53Hybrid H3 (125M)23.7YesHungry Hungry Hippos: Towards Language Modeling ...2022-12-28Code
54Transformer-XL Standard24NoTransformer-XL: Attentive Language Models Beyond...2019-01-09Code
55DeLighT24.14NoDeLighT: Deep and Light-weight Transformer2020-08-03Code
56[?]-former (Sticky memories)24.22No$\infty$-former: Infinite Memory Transformer2021-09-01Code
57\infty-former (Sticky memories)24.22No$\infty$-former: Infinite Memory Transformer2021-09-01Code
58∞-former (Sticky memories)24.22No$\infty$-former: Infinite Memory Transformer2021-09-01Code
59Transformer-N25.2NoRevisiting Simple Neural Probabilistic Language ...2021-04-08Code
60Linear Attention 125M25.6NoTransformers are RNNs: Fast Autoregressive Trans...2020-06-29Code
61FNetAR Medium25.81NoFNetAR: Mixing Tokens with Autoregressive Fourie...2021-07-22Code
62Reformer 125M26NoReformer: The Efficient Transformer2020-01-13Code
63GPT-2 Medium26.37Yes--Code
64Performer 125M26.8NoRethinking Attention with Performers2020-09-30Code
65AdvSoft (+ 4 layer QRNN + dynamic eval)28NoImproving Neural Language Modeling via Adversari...2019-06-10Code
66DEQ-TrellisNet29NoDeep Equilibrium Models2019-09-03Code
67Trellis Network29.19NoTrellis Networks for Sequence Modeling2018-10-15Code
68LSTM (Hebbian, Cache, MbPA)29.2NoFast Parametric Learning with Activation Memoriz...2018-03-27-
69LSTM (Hebbian, Cache)29.7NoFast Parametric Learning with Activation Memoriz...2018-03-27-
70Rfa-Gate-Gaussian-Stateful (Small)30.5NoRandom Feature Attention2021-03-03-
71Primal.+Trans.31NoPrimal-Attention: Self-attention through Asymmet...2023-05-31Code
72LSTM (RMC)31.6NoRelational recurrent neural networks2018-06-05Code
73DEQ-Transformer (small)32.4NoDeep Equilibrium Models2019-09-03Code
74AWD-LSTM-MoS + ATOI32.85NoAlleviating Sequence Information Loss with Data ...2019-09-18Code
754 layer QRNN33NoAn Analysis of Neural Language Modeling at Multi...2018-03-22Code
76LSTM (Hebbian)34.3NoFast Parametric Learning with Activation Memoriz...2018-03-27-
77LSTM36.4NoFast Parametric Learning with Activation Memoriz...2018-03-27-
78GCNN-837.2NoLanguage Modeling with Gated Convolutional Netwo...2016-12-23Code
79GPT-2 Small37.5Yes--Code
80Neural cache model (size = 2,000)40.8NoImproving Neural Language Models with a Continuo...2016-12-13Code
81Neural cache model (size = 100)44.8NoImproving Neural Language Models with a Continuo...2016-12-13Code
82GCNN-844.9NoLanguage Modeling with Gated Convolutional Netwo...2016-12-23Code
83TCN45.19NoAn Empirical Evaluation of Generic Convolutional...2018-03-04Code
84Temporal CNN45.2No---
85LSTM48.7NoImproving Neural Language Models with a Continuo...2016-12-13Code