| 1 | LSTM (7 layers) | 1.67 | No | Generating Sequences With Recurrent Neural Netwo... | 2013-08-04 | Code |
| 2 | Hypernetworks | 1.34 | No | HyperNetworks | 2016-09-27 | Code |
| 3 | SHA-LSTM (4 layers, h=1024, no attention head) | 1.33 | No | Single Headed Attention RNN: Stop Thinking With ... | 2019-11-26 | Code |
| 4 | LN HM-LSTM | 1.32 | No | Hierarchical Multiscale Recurrent Neural Networks | 2016-09-06 | Code |
| 5 | ByteNet | 1.31 | No | Neural Machine Translation in Linear Time | 2016-10-31 | Code |
| 6 | Recurrent Highway Networks | 1.27 | No | Recurrent Highway Networks | 2016-07-12 | Code |
| 7 | Large FS-LSTM-4 | 1.25 | No | Fast-Slow Recurrent Neural Networks | 2017-05-24 | Code |
| 8 | Large mLSTM | 1.24 | No | Multiplicative LSTM for sequence modelling | 2016-09-26 | Code |
| 9 | AWD-LSTM (3 layers) | 1.232 | No | An Analysis of Neural Language Modeling at Multi... | 2018-03-22 | Code |
| 10 | Cluster-Former (#C=512) | 1.22 | No | Cluster-Former: Clustering-based Sparse Transfor... | 2020-09-13 | - |
| 11 | LSTM | 1.195 | No | Mogrifier LSTM | 2019-09-04 | Code |
| 12 | Mogrifier LSTM | 1.146 | No | Mogrifier LSTM | 2019-09-04 | Code |
| 13 | 64-layer Character Transformer Model | 1.11 | No | Character-Level Language Modeling with Deeper Se... | 2018-08-09 | Code |
| 14 | SHA-RNN (4 layers, h=1024, single attention head) | 1.076 | No | Single Headed Attention RNN: Stop Thinking With ... | 2019-11-26 | Code |
| 15 | SHA-RNN (4 layers, h=1024, attention head per layer) | 1.068 | No | Single Headed Attention RNN: Stop Thinking With ... | 2019-11-26 | Code |
| 16 | Transformer (64 layers) | 1.06 | No | Character-Level Language Modeling with Deeper Se... | 2018-08-09 | Code |
| 17 | Transformer-XL (12 layers) | 1.06 | No | Transformer-XL: Attentive Language Models Beyond... | 2019-01-09 | Code |
| 18 | Skip Cross-Head Transformer-XL | 1.033 | No | Memory-efficient Stochastic methods for Memory-b... | 2023-11-14 | Code |
| 19 | Transformer-XL (18 layers) | 1.03 | No | Transformer-XL: Attentive Language Models Beyond... | 2019-01-09 | Code |
| 20 | Transformer+SSA | 1.024 | No | The Information Pathways Hypothesis: Transformer... | 2023-06-02 | Code |
| 21 | Transformer (12 layers, 8k adaptive span) | 1.02 | No | Adaptive Attention Span in Transformers | 2019-05-19 | Code |
| 22 | BP-Transformer (12 layers) | 1.02 | No | BP-Transformer: Modelling Long-Range Context via... | 2019-11-11 | Code |
| 23 | All-attention network (18 layers) | 1.01 | No | Augmenting Self-attention with Persistent Memory | 2019-07-02 | Code |
| 24 | Longformer (12 layers, h=512) | 1 | No | Longformer: The Long-Document Transformer | 2020-04-10 | Code |
| 25 | Hourglass | 0.997 | No | Hierarchical Transformers Are More Efficient Lan... | 2021-10-26 | Code |
| 26 | Transformer-XL (24 layers) | 0.99 | No | Transformer-XL: Attentive Language Models Beyond... | 2019-01-09 | Code |
| 27 | Longformer (30 layers, h=512) | 0.99 | No | Longformer: The Long-Document Transformer | 2020-04-10 | Code |
| 28 | Sparse Transformer (30 layers, fixed attn) | 0.99 | No | Generating Long Sequences with Sparse Transformers | 2019-04-23 | Code |
| 29 | Routing Transformer (12 layers) | 0.99 | No | Efficient Content-Based Sparse Attention with Ro... | 2020-03-12 | Code |
| 30 | Transformer-LS (small) | 0.99 | No | Long-Short Transformer: Efficient Transformers f... | 2021-07-05 | Code |
| 31 | Transformer (24 layers, 8k adaptive span) | 0.98 | No | Adaptive Attention Span in Transformers | 2019-05-19 | Code |
| 32 | Compressive Transformer (24 layers) | 0.97 | No | Compressive Transformers for Long-Range Sequence... | 2019-11-13 | Code |
| 33 | Transformer-LS (large) | 0.97 | No | Long-Short Transformer: Efficient Transformers f... | 2021-07-05 | Code |
| 34 | SRU++ Base | 0.97 | No | When Attention Meets Fast Recurrence: Training L... | 2021-02-24 | Code |
| 35 | Sandwich Transformer (adaptive span) | 0.968 | No | Improving Transformer Models by Reordering their... | 2019-11-10 | Code |
| 36 | Feedback Transformer | 0.96 | No | Addressing Some Limitations of Transformers with... | 2020-02-21 | Code |
| 37 | Expire-Span (24 layers) | 0.95 | No | Not All Memories are Created Equal: Learning to ... | 2021-05-13 | Code |
| 38 | SRU++ Large | 0.95 | No | When Attention Meets Fast Recurrence: Training L... | 2021-02-24 | Code |
| 39 | Transformer-XL (24 layers, RMS dynamic eval, decay) | 0.94 | Yes | Dynamic Evaluation of Transformer Language Models | 2019-04-17 | Code |
| 40 | Focus | 0.94 | No | Focus Your Attention (with Adaptive IIR Filters) | 2023-05-24 | - |
| 41 | GPT-2 (48 layers, h=1600) | 0.93 | Yes | - | - | Code |