| 1 | Transformer+BT (ADMIN init) | 46.4 | Yes | Very Deep Transformers for Neural Machine Transl... | 2020-08-18 | Code |
| 2 | Noisy back-translation | 45.6 | Yes | Understanding Back-Translation at Scale | 2018-08-28 | Code |
| 3 | mRASP+Fine-Tune | 44.3 | Yes | Pre-training Multilingual Neural Machine Transla... | 2020-10-07 | Code |
| 4 | Transformer + R-Drop | 43.95 | No | R-Drop: Regularized Dropout for Neural Networks | 2021-06-28 | Code |
| 5 | Transformer (ADMIN init) | 43.8 | No | Very Deep Transformers for Neural Machine Transl... | 2020-08-18 | Code |
| 6 | Admin | 43.8 | No | Understanding the Difficulty of Training Transfo... | 2020-04-17 | Code |
| 7 | BERT-fused NMT | 43.78 | Yes | Incorporating BERT into Neural Machine Translation | 2020-02-17 | Code |
| 8 | MUSE(Paralllel Multi-scale Attention) | 43.5 | No | MUSE: Parallel Multi-Scale Attention for Sequenc... | 2019-11-17 | Code |
| 9 | T5 | 43.4 | Yes | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 10 | Local Joint Self-attention | 43.3 | No | Joint Source-Target Self Attention with Locality... | 2019-05-16 | Code |
| 11 | Depth Growing | 43.27 | No | Depth Growing for Neural Machine Translation | 2019-07-03 | Code |
| 12 | Transformer Big | 43.2 | No | Scaling Neural Machine Translation | 2018-06-01 | Code |
| 13 | DynamicConv | 43.2 | No | Pay Less Attention with Lightweight and Dynamic ... | 2019-01-29 | Code |
| 14 | TaLK Convolutions | 43.2 | No | Time-aware Large Kernel Convolutions | 2020-02-08 | Code |
| 15 | LightConv | 43.1 | No | Pay Less Attention with Lightweight and Dynamic ... | 2019-01-29 | Code |
| 16 | FLOATER-large | 42.7 | No | Learning to Encode Position for Transformer with... | 2020-03-13 | Code |
| 17 | OmniNetP | 42.6 | No | OmniNet: Omnidirectional Representations from Tr... | 2021-03-01 | Code |
| 18 | Transformer Big + MoS | 42.1 | No | Fast and Simple Mixture of Softmaxes with BPE an... | 2018-09-25 | Code |
| 19 | T2R + Pretrain | 42.1 | No | Finetuning Pretrained Transformers into RNNs | 2021-03-24 | Code |
| 20 | Synthesizer (Random + Vanilla) | 41.85 | No | Synthesizer: Rethinking Self-Attention in Transf... | 2020-05-02 | Code |
| 21 | Hardware Aware Transformer | 41.8 | No | HAT: Hardware-Aware Transformers for Efficient N... | 2020-05-28 | Code |
| 22 | Transformer (big) + Relative Position Representations | 41.5 | No | Self-Attention with Relative Position Representa... | 2018-03-06 | Code |
| 23 | Stack 4-layer RNNSearch + Dual Learning + Deliberation Network | 41.5 | No | - | - | - |
| 24 | Weighted Transformer (large) | 41.4 | No | Weighted Transformer Network for Machine Transla... | 2017-11-06 | Code |
| 25 | ConvS2S (ensemble) | 41.3 | No | Convolutional Sequence to Sequence Learning | 2017-05-08 | Code |
| 26 | Evolved Transformer Big | 41.3 | No | The Evolved Transformer | 2019-01-30 | Code |
| 27 | RNMT+ | 41 | No | The Best of Both Worlds: Combining Recent Advanc... | 2018-04-26 | Code |
| 28 | Transformer Big | 41 | Yes | Attention Is All You Need | 2017-06-12 | Code |
| 29 | Evolved Transformer Base | 40.6 | No | The Evolved Transformer | 2019-01-30 | Code |
| 30 | ResMLP-12 | 40.6 | No | ResMLP: Feedforward networks for image classific... | 2021-05-07 | Code |
| 31 | MoE | 40.56 | No | Outrageously Large Neural Networks: The Sparsely... | 2017-01-23 | Code |
| 32 | Transformer | 40.5 | No | Memory-Efficient Adaptive Optimization | 2019-01-30 | Code |
| 33 | ConvS2S | 40.46 | No | Convolutional Sequence to Sequence Learning | 2017-05-08 | Code |
| 34 | ResMLP-6 | 40.3 | No | ResMLP: Feedforward networks for image classific... | 2021-05-07 | Code |
| 35 | TransformerBase + AutoDropout | 40 | No | AutoDropout: Learning Dropout Patterns to Regula... | 2021-01-05 | Code |
| 36 | GNMT+RL | 39.9 | No | Google's Neural Machine Translation System: Brid... | 2016-09-26 | Code |
| 37 | Lite Transformer | 39.6 | No | Lite Transformer with Long-Short Range Attention | 2020-04-24 | Code |
| 38 | Deep-Att + PosUnk | 39.2 | No | Deep Recurrent Models with Fast-Forward Connecti... | 2016-06-14 | Code |
| 39 | Rfa-Gate-arccos | 39.2 | No | Random Feature Attention | 2021-03-03 | - |
| 40 | Transformer Base | 38.1 | No | Attention Is All You Need | 2017-06-12 | Code |
| 41 | LSTM6 + PosUnk | 37.5 | No | Addressing the Rare Word Problem in Neural Machi... | 2014-10-30 | Code |
| 42 | PBMT | 37 | No | - | - | - |
| 43 | SMT+LSTM5 | 36.5 | No | Sequence to Sequence Learning with Neural Networks | 2014-09-10 | Code |
| 44 | RNN-search50* | 36.2 | No | Neural Machine Translation by Jointly Learning t... | 2014-09-01 | Code |
| 45 | Deep-Att | 35.9 | No | Deep Recurrent Models with Fast-Forward Connecti... | 2016-06-14 | Code |
| 46 | Deep Convolutional Encoder; single-layer decoder | 35.7 | No | A Convolutional Encoder Model for Neural Machine... | 2016-11-07 | Code |
| 47 | LSTM | 34.8 | No | Sequence to Sequence Learning with Neural Networks | 2014-09-10 | Code |
| 48 | CSLM + RNN + WP | 34.54 | No | Learning Phrase Representations using RNN Encode... | 2014-06-03 | Code |
| 49 | FLAN 137B (zero-shot) | 33.9 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 50 | FLAN 137B (few-shot, k=9) | 33.8 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 51 | Regularized LSTM | 29.03 | No | Recurrent Neural Network Regularization | 2014-09-08 | Code |
| 52 | Unsupervised PBSMT | 28.11 | No | Phrase-Based & Neural Unsupervised Machine Trans... | 2018-04-20 | Code |
| 53 | PBSMT + NMT | 27.6 | No | Phrase-Based & Neural Unsupervised Machine Trans... | 2018-04-20 | Code |
| 54 | GRU+Attention | 26.4 | No | Can Active Memory Replace Attention? | 2016-10-27 | Code |
| 55 | SMT + iterative backtranslation (unsupervised) | 26.22 | No | Unsupervised Statistical Machine Translation | 2018-09-04 | Code |
| 56 | Unsupervised NMT + Transformer | 25.14 | No | Phrase-Based & Neural Unsupervised Machine Trans... | 2018-04-20 | Code |
| 57 | Unsupervised attentional encoder-decoder + BPE | 14.36 | No | Unsupervised Neural Machine Translation | 2017-10-30 | Code |