| 1 | GPT-3 (Zero-Shot) | 20.5 | Yes | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 2 | BERT-Large-CAS | 31.3 | Yes | Language Models with Transformers | 2019-04-20 | Code |
| 3 | GPT-2 | 35.76 | Yes | - | - | Code |
| 4 | Mogrifier LSTM + dynamic eval | 44.9 | No | Mogrifier LSTM | 2019-09-04 | Code |
| 5 | adversarial + AWD-LSTM-MoS + dynamic eval | 46.01 | No | Improving Neural Language Modeling via Adversari... | 2019-06-10 | Code |
| 6 | GL-LWGC + AWD-MoS-LSTM + dynamic eval | 46.34 | No | Gradual Learning of Recurrent Neural Networks | 2017-08-29 | Code |
| 7 | FRAGE + AWD-LSTM-MoS + dynamic eval | 46.54 | No | FRAGE: Frequency-Agnostic Word Representation | 2018-09-18 | Code |
| 8 | AWD-LSTM-DOC x5 | 47.17 | No | Direct Output Connection for a High-Rank Languag... | 2018-08-30 | Code |
| 9 | Past Decode Reg. + AWD-LSTM-MoS + dyn. eval. | 47.3 | No | Improved Language Modeling by Decoding the Past | 2018-08-14 | - |
| 10 | Ensemble of All | 47.31 | No | Advancing State of the Art in Language Modeling | 2023-11-28 | Code |
| 11 | AWD-LSTM-MoS + dynamic eval | 47.69 | No | Breaking the Softmax Bottleneck: A High-Rank RNN... | 2017-11-10 | Code |
| 12 | AWD-LSTM-DRILL + dynamic eval | 49.4 | No | Deep Residual Output Layers for Neural Language ... | 2019-05-14 | Code |
| 13 | Dense IndRNN+dynamic eval | 50.97 | No | Deep Independently Recurrent Neural Network (Ind... | 2019-10-11 | Code |
| 14 | AWD-LSTM + dynamic eval | 51.1 | No | Dynamic Evaluation of Neural Sequence Models | 2017-09-21 | Code |
| 15 | AWD-LSTM-DOC + Partial Shuffle | 52 | No | Partially Shuffling the Training Data to Improve... | 2019-03-11 | Code |
| 16 | AWD-LSTM-DOC | 52.38 | No | Direct Output Connection for a High-Rank Languag... | 2018-08-30 | Code |
| 17 | AWD-LSTM + continuous cache pointer | 52.8 | No | Regularizing and Optimizing LSTM Language Models | 2017-08-07 | Code |
| 18 | AWD-LSTM-MoS + Partial Shuffle | 53.92 | No | Partially Shuffling the Training Data to Improve... | 2019-03-11 | Code |
| 19 | Trellis Network | 54.19 | No | Trellis Networks for Sequence Modeling | 2018-10-15 | Code |
| 20 | AWD-LSTM-MoS | 54.44 | No | Breaking the Softmax Bottleneck: A High-Rank RNN... | 2017-11-10 | Code |
| 21 | AWD-FWM Schlag et al. (2020) | 54.48 | No | Learning Associative Inference Using Fast Weight... | 2020-11-16 | Code |
| 22 | Transformer-XL | 54.55 | No | Transformer-XL: Attentive Language Models Beyond... | 2019-01-09 | Code |
| 23 | Transformer-XL + AutoDropout | 54.9 | No | AutoDropout: Learning Dropout Patterns to Regula... | 2021-01-05 | Code |
| 24 | 2-layer skip-LSTM + dropout tuning | 55.3 | No | Pushing the bounds of dropout | 2018-05-23 | Code |
| 25 | AWD-LSTM-DRILL | 55.7 | No | Deep Residual Output Layers for Neural Language ... | 2019-05-14 | Code |
| 26 | Differentiable NAS | 56.1 | No | DARTS: Differentiable Architecture Search | 2018-06-24 | Code |
| 27 | Dense IndRNN | 56.37 | No | Deep Independently Recurrent Neural Network (Ind... | 2019-10-11 | Code |
| 28 | AWD-LSTM 3-layer with Fraternal dropout | 56.8 | No | Fraternal Dropout | 2017-10-31 | Code |
| 29 | DEQ-TrellisNet | 57.1 | No | Deep Equilibrium Models | 2019-09-03 | Code |
| 30 | AWD-LSTM | 57.3 | No | Regularizing and Optimizing LSTM Language Models | 2017-08-07 | Code |
| 31 | Efficient NAS | 58.6 | No | Efficient Neural Architecture Search via Paramet... | 2018-02-09 | Code |
| 32 | NAS-RL | 64 | No | Neural Architecture Search with Reinforcement Le... | 2016-11-05 | Code |
| 33 | Recurrent highway networks | 65.4 | No | Recurrent Highway Networks | 2016-07-12 | Code |
| 34 | Inan et al. (2016) - Variational RHN | 66 | No | Tying Word Vectors and Word Classifiers: A Loss ... | 2016-11-04 | Code |
| 35 | Gal & Ghahramani (2016) - Variational LSTM (large) | 75.2 | No | A Theoretically Grounded Application of Dropout ... | 2015-12-16 | Code |
| 36 | Zaremba et al. (2014) - LSTM (large) | 78.4 | No | Recurrent Neural Network Regularization | 2014-09-08 | Code |
| 37 | LSTM (Bai et al., 2018) | 78.93 | No | An Empirical Evaluation of Generic Convolutional... | 2018-03-04 | Code |
| 38 | Gal & Ghahramani (2016) - Variational LSTM (medium) | 79.7 | No | A Theoretically Grounded Application of Dropout ... | 2015-12-16 | Code |
| 39 | Zaremba et al. (2014) - LSTM (medium) | 82.7 | No | Recurrent Neural Network Regularization | 2014-09-08 | Code |
| 40 | R-Transformer | 84.38 | No | R-Transformer: Recurrent Neural Network Enhanced... | 2019-07-12 | Code |
| 41 | GRU (Bai et al., 2018) | 92.48 | No | An Empirical Evaluation of Generic Convolutional... | 2018-03-04 | Code |
| 42 | Seq-U-Net | 107.95 | No | Seq-U-Net: A One-Dimensional Causal U-Net for Ef... | 2019-11-14 | Code |
| 43 | TCN | 108.47 | No | Seq-U-Net: A One-Dimensional Causal U-Net for Ef... | 2019-11-14 | Code |