| 1 | RETRO (7.5B) | 2.4 | Yes | Improving language models by retrieving from tri... | 2021-12-08 | Code |
| 2 | Hybrid H3 (2.7B) | 10.6 | Yes | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 3 | Megatron-LM | 10.81 | Yes | Megatron-LM: Training Multi-Billion Parameter La... | 2019-09-17 | Code |
| 4 | GLM-XXLarge (bidirectional) | 11.33 | Yes | GLM: General Language Model Pretraining with Aut... | 2021-03-18 | Code |
| 5 | GLM-XXLarge (unidirectional) | 12.22 | Yes | GLM: General Language Model Pretraining with Aut... | 2021-03-18 | Code |
| 6 | Hybrid H3 (1.3B) | 12.5 | Yes | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 7 | Ensemble of All | 13.29 | No | Advancing State of the Art in Language Modeling | 2023-11-28 | Code |
| 8 | GateLoop (125M) | 13.4 | No | GateLoop: Fully Data-Controlled Linear Recurrenc... | 2023-11-03 | Code |
| 9 | kNN-LM w/ Adaptive Coefficient | 15.5 | No | You can't pick your neighbors, or can you? When ... | 2022-10-28 | Code |
| 10 | kNN-LM w/ Continuous Cache | 15.79 | No | Generalization through Memorization: Nearest Nei... | 2019-11-01 | Code |
| 11 | Routing Transformer | 15.8 | No | Efficient Content-Based Sparse Attention with Ro... | 2020-03-12 | Code |
| 12 | kNN-LM | 16.12 | No | Generalization through Memorization: Nearest Nei... | 2019-11-01 | Code |
| 13 | Transformer-XL (RMS dynamic eval) | 16.4 | Yes | Dynamic Evaluation of Transformer Language Models | 2019-04-17 | Code |
| 14 | [?]-former (SM) | 16.61 | No | $\infty$-former: Infinite Memory Transformer | 2021-09-01 | Code |
| 15 | -former (SM) | 16.61 | No | $\infty$-former: Infinite Memory Transformer | 2021-09-01 | Code |
| 16 | ∞-former (Sticky memories + initialized GPT-2 Small) | 16.61 | Yes | $\infty$-former: Infinite Memory Transformer | 2021-09-01 | Code |
| 17 | ∞-former (initialized GPT-2 Small) | 16.64 | Yes | $\infty$-former: Infinite Memory Transformer | 2021-09-01 | Code |
| 18 | Hybrid H3 (355M) | 16.9 | Yes | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 19 | Transformer-XL (SGD dynamic eval) | 17 | No | Dynamic Evaluation of Transformer Language Models | 2019-04-17 | Code |
| 20 | Compressive Transformer (18L, M=1024) | 17.1 | No | Compressive Transformers for Long-Range Sequence... | 2019-11-13 | Code |
| 21 | SRU++ Large | 17.1 | No | When Attention Meets Fast Recurrence: Training L... | 2021-02-24 | Code |
| 22 | SegaTransformer-XL | 17.1 | No | Segatron: Segment-Aware Transformer for Language... | 2020-04-30 | Code |
| 23 | Transformer+SSA+Self-ensemble | 17.18 | No | The Information Pathways Hypothesis: Transformer... | 2023-06-02 | Code |
| 24 | Transformer-XL Large + Phrase Induction | 17.4 | No | Improving Neural Language Models by Segmenting, ... | 2019-06-04 | Code |
| 25 | GPT-2 Full | 17.48 | Yes | - | - | Code |
| 26 | Staged Training | 17.56 | No | Shortformer: Better Language Modeling using Shor... | 2020-12-31 | Code |
| 27 | Transformer+SSA | 17.6 | No | The Information Pathways Hypothesis: Transformer... | 2023-06-02 | Code |
| 28 | Sandwich Transformer | 17.96 | No | Improving Transformer Models by Reordering their... | 2019-11-10 | Code |
| 29 | DIFFQ (λ=1, g=16) | 18 | No | Differentiable Model Compression via Pseudo Quan... | 2021-04-20 | Code |
| 30 | Mega | 18.07 | No | Mega: Moving Average Equipped Gated Attention | 2022-09-21 | Code |
| 31 | Shortformer | 18.15 | No | Shortformer: Better Language Modeling using Shor... | 2020-12-31 | Code |
| 32 | Feedback Transformer (8 layers) | 18.2 | No | Addressing Some Limitations of Transformers with... | 2020-02-21 | Code |
| 33 | SRU++ Base | 18.3 | No | When Attention Meets Fast Recurrence: Training L... | 2021-02-24 | Code |
| 34 | Transformer-XL Large | 18.3 | No | Transformer-XL: Attentive Language Models Beyond... | 2019-01-09 | Code |
| 35 | PAR Transformer Large | 18.4 | No | Pay Attention when Required | 2020-09-09 | Code |
| 36 | Perceiver AR 358M | 18.4 | No | General-purpose, long-context autoregressive mod... | 2022-02-15 | Code |
| 37 | Hyena-3-slim | 18.5 | No | Hyena Hierarchy: Towards Larger Convolutional La... | 2023-02-21 | Code |
| 38 | Hybrid H3 125M | 18.5 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 39 | Hyena-3 | 18.6 | No | Hyena Hierarchy: Towards Larger Convolutional La... | 2023-02-21 | Code |
| 40 | Transformer (Adaptive inputs) | 18.7 | No | Adaptive Input Representations for Neural Langua... | 2018-09-28 | Code |
| 41 | T2R + Pretrain | 19.6 | No | Finetuning Pretrained Transformers into RNNs | 2021-03-24 | Code |
| 42 | Subformer | 20.39 | No | - | - | - |
| 43 | BERT-Large-CAS | 20.4 | No | Language Models with Transformers | 2019-04-20 | Code |
| 44 | All-attention network (36 layers) | 20.6 | No | Augmenting Self-attention with Persistent Memory | 2019-07-02 | Code |
| 45 | S4 | 21.28 | No | Efficiently Modeling Long Sequences with Structu... | 2021-10-31 | Code |
| 46 | GPT-2 Large | 22.05 | Yes | - | - | Code |
| 47 | Feedback Transformer (4 layers) | 22.4 | No | Addressing Some Limitations of Transformers with... | 2020-02-21 | Code |
| 48 | PAR Transformer Base | 22.7 | No | Pay Attention when Required | 2020-09-09 | Code |
| 49 | Skip Cross-Head Transformer-XL | 22.91 | No | Memory-efficient Stochastic methods for Memory-b... | 2023-11-14 | Code |
| 50 | DEQ-Transformer (medium, adaptive embed) | 23.2 | No | Deep Equilibrium Models | 2019-09-03 | Code |
| 51 | TaLK Convolutions | 23.3 | No | Time-aware Large Kernel Convolutions | 2020-02-08 | Code |
| 52 | Rfa-Gate-Gaussian-Stateful (Big) | 23.5 | No | Random Feature Attention | 2021-03-03 | - |
| 53 | Hybrid H3 (125M) | 23.7 | Yes | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 54 | Transformer-XL Standard | 24 | No | Transformer-XL: Attentive Language Models Beyond... | 2019-01-09 | Code |
| 55 | DeLighT | 24.14 | No | DeLighT: Deep and Light-weight Transformer | 2020-08-03 | Code |
| 56 | [?]-former (Sticky memories) | 24.22 | No | $\infty$-former: Infinite Memory Transformer | 2021-09-01 | Code |
| 57 | \infty-former (Sticky memories) | 24.22 | No | $\infty$-former: Infinite Memory Transformer | 2021-09-01 | Code |
| 58 | ∞-former (Sticky memories) | 24.22 | No | $\infty$-former: Infinite Memory Transformer | 2021-09-01 | Code |
| 59 | Transformer-N | 25.2 | No | Revisiting Simple Neural Probabilistic Language ... | 2021-04-08 | Code |
| 60 | Linear Attention 125M | 25.6 | No | Transformers are RNNs: Fast Autoregressive Trans... | 2020-06-29 | Code |
| 61 | FNetAR Medium | 25.81 | No | FNetAR: Mixing Tokens with Autoregressive Fourie... | 2021-07-22 | Code |
| 62 | Reformer 125M | 26 | No | Reformer: The Efficient Transformer | 2020-01-13 | Code |
| 63 | GPT-2 Medium | 26.37 | Yes | - | - | Code |
| 64 | Performer 125M | 26.8 | No | Rethinking Attention with Performers | 2020-09-30 | Code |
| 65 | AdvSoft (+ 4 layer QRNN + dynamic eval) | 28 | No | Improving Neural Language Modeling via Adversari... | 2019-06-10 | Code |
| 66 | DEQ-TrellisNet | 29 | No | Deep Equilibrium Models | 2019-09-03 | Code |
| 67 | Trellis Network | 29.19 | No | Trellis Networks for Sequence Modeling | 2018-10-15 | Code |
| 68 | LSTM (Hebbian, Cache, MbPA) | 29.2 | No | Fast Parametric Learning with Activation Memoriz... | 2018-03-27 | - |
| 69 | LSTM (Hebbian, Cache) | 29.7 | No | Fast Parametric Learning with Activation Memoriz... | 2018-03-27 | - |
| 70 | Rfa-Gate-Gaussian-Stateful (Small) | 30.5 | No | Random Feature Attention | 2021-03-03 | - |
| 71 | Primal.+Trans. | 31 | No | Primal-Attention: Self-attention through Asymmet... | 2023-05-31 | Code |
| 72 | LSTM (RMC) | 31.6 | No | Relational recurrent neural networks | 2018-06-05 | Code |
| 73 | DEQ-Transformer (small) | 32.4 | No | Deep Equilibrium Models | 2019-09-03 | Code |
| 74 | AWD-LSTM-MoS + ATOI | 32.85 | No | Alleviating Sequence Information Loss with Data ... | 2019-09-18 | Code |
| 75 | 4 layer QRNN | 33 | No | An Analysis of Neural Language Modeling at Multi... | 2018-03-22 | Code |
| 76 | LSTM (Hebbian) | 34.3 | No | Fast Parametric Learning with Activation Memoriz... | 2018-03-27 | - |
| 77 | LSTM | 36.4 | No | Fast Parametric Learning with Activation Memoriz... | 2018-03-27 | - |
| 78 | GCNN-8 | 37.2 | No | Language Modeling with Gated Convolutional Netwo... | 2016-12-23 | Code |
| 79 | GPT-2 Small | 37.5 | Yes | - | - | Code |
| 80 | Neural cache model (size = 2,000) | 40.8 | No | Improving Neural Language Models with a Continuo... | 2016-12-13 | Code |
| 81 | Neural cache model (size = 100) | 44.8 | No | Improving Neural Language Models with a Continuo... | 2016-12-13 | Code |
| 82 | GCNN-8 | 44.9 | No | Language Modeling with Gated Convolutional Netwo... | 2016-12-23 | Code |
| 83 | TCN | 45.19 | No | An Empirical Evaluation of Generic Convolutional... | 2018-03-04 | Code |
| 84 | Temporal CNN | 45.2 | No | - | - | - |
| 85 | LSTM | 48.7 | No | Improving Neural Language Models with a Continuo... | 2016-12-13 | Code |