| 1 | SparseGPT (175B, 50% Sparsity) | 8.21 | Yes | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |
| 2 | OPT-175B | 8.34 | Yes | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |
| 3 | SparseGPT (175B, 4:8 Sparsity) | 8.45 | Yes | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |
| 4 | SparseGPT (175B, 2:4 Sparsity) | 8.73 | Yes | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |
| 5 | GPT-2 (fine-tuned) | 15.17 | Yes | Hydra: A System for Large Multi-Model Deep Learn... | 2021-10-16 | Code |
| 6 | GPT-2 | 18.34 | Yes | - | - | Code |
| 7 | GPT-2 (large) | 19.93 | Yes | - | - | Code |
| 8 | GPT-2 (medium) | 22.76 | Yes | - | - | Code |
| 9 | GPT-2 (small) | 29.41 | Yes | - | - | Code |
| 10 | BERT-Large-CAS | 34.1 | Yes | Language Models with Transformers | 2019-04-20 | Code |
| 11 | Mogrifier LSTM + dynamic eval | 38.6 | No | Mogrifier LSTM | 2019-09-04 | Code |
| 12 | adversarial + AWD-LSTM-MoS + dynamic eval | 38.65 | No | Improving Neural Language Modeling via Adversari... | 2019-06-10 | Code |
| 13 | FRAGE + AWD-LSTM-MoS + dynamic eval | 39.14 | No | FRAGE: Frequency-Agnostic Word Representation | 2018-09-18 | Code |
| 14 | Past Decode Reg. + AWD-LSTM-MoS + dyn. eval. | 40.3 | No | Improved Language Modeling by Decoding the Past | 2018-08-14 | - |
| 15 | GL-LWGC + AWD-MoS-LSTM + dynamic eval | 40.46 | No | Gradual Learning of Recurrent Neural Networks | 2017-08-29 | Code |
| 16 | AWD-LSTM-MoS + dynamic eval | 40.68 | No | Breaking the Softmax Bottleneck: A High-Rank RNN... | 2017-11-10 | Code |
| 17 | AWD-LSTM-DRILL + dynamic eval | 42 | No | Deep Residual Output Layers for Neural Language ... | 2019-05-14 | Code |
| 18 | AWD-LSTM + dynamic eval | 44.3 | No | Dynamic Evaluation of Neural Sequence Models | 2017-09-21 | Code |
| 19 | AWD-LSTM + continuous cache pointer | 52 | No | Regularizing and Optimizing LSTM Language Models | 2017-08-07 | Code |
| 20 | AWD-LSTM-DOC x5 | 53.09 | No | Direct Output Connection for a High-Rank Languag... | 2018-08-30 | Code |
| 21 | Ensemble of All | 53.73 | No | Advancing State of the Art in Language Modeling | 2023-11-28 | Code |
| 22 | Mogrifier LSTM | 55.1 | No | Mogrifier LSTM | 2019-09-04 | Code |
| 23 | AWD-LSTM-DOC + Partial Shuffle | 57.85 | No | Partially Shuffling the Training Data to Improve... | 2019-03-11 | Code |
| 24 | AWD-LSTM-DOC | 58.03 | No | Direct Output Connection for a High-Rank Languag... | 2018-08-30 | Code |
| 25 | AWD-LSTM-MoS + Partial Shuffle | 59.98 | No | Partially Shuffling the Training Data to Improve... | 2019-03-11 | Code |
| 26 | AWD-LSTM-MoS | 61.45 | No | Breaking the Softmax Bottleneck: A High-Rank RNN... | 2017-11-10 | Code |
| 27 | AWD-FWM Schlag et al. (2020) | 61.65 | No | Learning Associative Inference Using Fast Weight... | 2020-11-16 | Code |
| 28 | AWD-LSTM-DRILL | 61.9 | No | Deep Residual Output Layers for Neural Language ... | 2019-05-14 | Code |
| 29 | AWD-LSTM 3-layer with Fraternal dropout | 64.1 | No | Fraternal Dropout | 2017-10-31 | Code |
| 30 | AWD-LSTM + ATOI | 64.73 | No | Alleviating Sequence Information Loss with Data ... | 2019-09-18 | Code |
| 31 | AWD-LSTM | 65.8 | No | Regularizing and Optimizing LSTM Language Models | 2017-08-07 | Code |
| 32 | Melis et al. (2017) - 1-layer LSTM (tied) | 65.9 | No | On the State of the Art of Evaluation in Neural ... | 2017-07-18 | Code |
| 33 | Grave et al. (2016) - LSTM + continuous cache pointer | 68.9 | No | Improving Neural Language Models with a Continuo... | 2016-12-13 | Code |
| 34 | EGRU | 68.9 | No | Efficient recurrent architectures through activi... | 2022-06-13 | Code |
| 35 | Inan et al. (2016) - Variational LSTM (tied) (h=650) + augmented loss | 87 | No | Tying Word Vectors and Word Classifiers: A Loss ... | 2016-11-04 | Code |
| 36 | Inan et al. (2016) - Variational LSTM (tied) (h=650) | 87.7 | No | Tying Word Vectors and Word Classifiers: A Loss ... | 2016-11-04 | Code |
| 37 | Grave et al. (2016) - LSTM | 99.3 | No | Improving Neural Language Models with a Continuo... | 2016-12-13 | Code |
| 38 | OPT-175B (50% Sparsity) | 234.77 | Yes | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |