| 1 | Transformer Cycle (Rev) | 35.14 | No | Lessons on Parameter Sharing across Layers in Tr... | 2021-04-13 | Code |
| 2 | Noisy back-translation | 35 | Yes | Understanding Back-Translation at Scale | 2018-08-28 | Code |
| 3 | Transformer+Rep(Uni) | 33.89 | No | Rethinking Perturbations in Encoder-Decoders for... | 2021-04-05 | Code |
| 4 | T5-11B | 32.1 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 5 | BiBERT | 31.26 | No | BERT, mBERT, or BiBERT? A Study on Contextualize... | 2021-09-09 | Code |
| 6 | Transformer + R-Drop | 30.91 | No | R-Drop: Regularized Dropout for Neural Networks | 2021-06-28 | Code |
| 7 | Bi-SimCut | 30.78 | No | Bi-SimCut: A Simple Strategy for Boosting Neural... | 2022-06-06 | Code |
| 8 | BERT-fused NMT | 30.75 | No | Incorporating BERT into Neural Machine Translation | 2020-02-17 | Code |
| 9 | Data Diversification - Transformer | 30.7 | No | Data Diversification: A Simple Strategy For Neur... | 2019-11-05 | Code |
| 10 | SimCut | 30.56 | No | Bi-SimCut: A Simple Strategy for Boosting Neural... | 2022-06-06 | Code |
| 11 | Mask Attention Network (big) | 30.4 | No | Mask Attention Networks: Rethinking and Strength... | 2021-03-25 | Code |
| 12 | Transformer (ADMIN init) | 30.1 | No | Very Deep Transformers for Neural Machine Transl... | 2020-08-18 | Code |
| 13 | PowerNorm (Transformer) | 30.1 | No | PowerNorm: Rethinking Batch Normalization in Tra... | 2020-03-17 | Code |
| 14 | Depth Growing | 30.07 | No | Depth Growing for Neural Machine Translation | 2019-07-03 | Code |
| 15 | MUSE(Parallel Multi-scale Attention) | 29.9 | No | MUSE: Parallel Multi-Scale Attention for Sequenc... | 2019-11-17 | Code |
| 16 | Evolved Transformer Big | 29.8 | No | The Evolved Transformer | 2019-01-30 | Code |
| 17 | OmniNetP | 29.8 | No | OmniNet: Omnidirectional Representations from Tr... | 2021-03-01 | Code |
| 18 | DynamicConv | 29.7 | No | Pay Less Attention with Lightweight and Dynamic ... | 2019-01-29 | Code |
| 19 | Local Joint Self-attention | 29.7 | No | Joint Source-Target Self Attention with Locality... | 2019-05-16 | Code |
| 20 | TaLK Convolutions | 29.6 | No | Time-aware Large Kernel Convolutions | 2020-02-08 | Code |
| 21 | Transformer Big + MoS | 29.6 | No | Fast and Simple Mixture of Softmaxes with BPE an... | 2018-09-25 | Code |
| 22 | AdvAug (aut+adv) | 29.57 | No | AdvAug: Robust Adversarial Augmentation for Neur... | 2020-06-21 | - |
| 23 | PartialFormer | 29.56 | No | PartialFormer: Modeling Part Instead of Whole fo... | 2023-10-23 | Code |
| 24 | Transformer Big + adversarial MLE | 29.52 | No | Improving Neural Language Modeling via Adversari... | 2019-06-10 | Code |
| 25 | Transformer Big | 29.3 | No | Scaling Neural Machine Translation | 2018-06-01 | Code |
| 26 | Subformer-xlarge | 29.3 | No | - | - | - |
| 27 | SB-NMT | 29.21 | No | Synchronous Bidirectional Neural Machine Transla... | 2019-05-13 | Code |
| 28 | Transformer (big) + Relative Position Representations | 29.2 | No | Self-Attention with Relative Position Representa... | 2018-03-06 | Code |
| 29 | FLOATER-large | 29.2 | No | Learning to Encode Position for Transformer with... | 2020-03-13 | Code |
| 30 | Local Transformer | 29.2 | No | Modeling Localness for Self-Attention Networks | 2018-10-24 | - |
| 31 | Transformer Big with FRAGE | 29.11 | No | FRAGE: Frequency-Agnostic Word Representation | 2018-09-18 | Code |
| 32 | Mask Attention Network (base) | 29.1 | No | Mask Attention Networks: Rethinking and Strength... | 2021-03-25 | Code |
| 33 | Mega | 29.01 | No | Mega: Moving Average Equipped Gated Attention | 2022-09-21 | Code |
| 34 | adequacy-oriented NMT | 28.99 | No | Neural Machine Translation with Adequacy-Oriente... | 2018-11-21 | - |
| 35 | LightConv | 28.9 | No | Pay Less Attention with Lightweight and Dynamic ... | 2019-01-29 | Code |
| 36 | Weighted Transformer (large) | 28.9 | No | Weighted Transformer Network for Machine Transla... | 2017-11-06 | Code |
| 37 | universal transformer base | 28.9 | No | Universal Transformers | 2018-07-10 | Code |
| 38 | KERMIT | 28.7 | No | KERMIT: Generative Insertion-Based Modeling for ... | 2019-06-04 | - |
| 39 | T2R + Pretrain | 28.7 | No | Finetuning Pretrained Transformers into RNNs | 2021-03-24 | Code |
| 40 | AdvAug (aut) | 28.58 | No | AdvAug: Robust Adversarial Augmentation for Neur... | 2020-06-21 | - |
| 41 | RNMT+ | 28.5 | No | The Best of Both Worlds: Combining Recent Advanc... | 2018-04-26 | Code |
| 42 | Synthesizer (Random + Vanilla) | 28.47 | No | Synthesizer: Rethinking Self-Attention in Transf... | 2020-05-02 | Code |
| 43 | Hardware Aware Transformer | 28.4 | No | HAT: Hardware-Aware Transformers for Efficient N... | 2020-05-28 | Code |
| 44 | Transformer Big | 28.4 | No | Attention Is All You Need | 2017-06-12 | Code |
| 45 | Transformer + SRU | 28.4 | No | Simple Recurrent Units for Highly Parallelizable... | 2017-09-08 | Code |
| 46 | Evolved Transformer Base | 28.4 | No | The Evolved Transformer | 2019-01-30 | Code |
| 47 | Rfa-Gate-arccos | 28.2 | No | Random Feature Attention | 2021-03-03 | - |
| 48 | Transformer-DRILL Base | 28.1 | No | Deep Residual Output Layers for Neural Language ... | 2019-05-14 | Code |
| 49 | AdvAug (mixup) | 28.08 | No | AdvAug: Robust Adversarial Augmentation for Neur... | 2020-06-21 | - |
| 50 | CMLM+LAT+4 iterations | 27.35 | No | Incorporating a Local Translation Mechanism into... | 2020-11-12 | Code |
| 51 | Transformer Base | 27.3 | No | Attention Is All You Need | 2017-06-12 | Code |
| 52 | Levenshtein Transformer (distillation) | 27.27 | No | Levenshtein Transformer | 2019-05-27 | Code |
| 53 | DisCo + Mask-Predict (non-autoregressive) | 27.06 | No | - | - | Code |
| 54 | Adaptively Sparse Transformer (alpha-entmax) | 26.93 | No | Adaptively Sparse Transformers | 2019-08-30 | Code |
| 55 | ResMLP-12 | 26.8 | No | ResMLP: Feedforward networks for image classific... | 2021-05-07 | Code |
| 56 | CNAT | 26.6 | No | Non-Autoregressive Translation by Learning Targe... | 2021-03-21 | Code |
| 57 | Lite Transformer | 26.5 | No | Lite Transformer with Long-Short Range Attention | 2020-04-24 | Code |
| 58 | ConvS2S (ensemble) | 26.4 | No | Convolutional Sequence to Sequence Learning | 2017-05-08 | Code |
| 59 | ResMLP-6 | 26.4 | No | ResMLP: Feedforward networks for image classific... | 2021-05-07 | Code |
| 60 | Average Attention Network | 26.31 | No | Accelerating Neural Transformer via an Average A... | 2018-05-02 | Code |
| 61 | GNMT+RL | 26.3 | No | Google's Neural Machine Translation System: Brid... | 2016-09-26 | Code |
| 62 | SliceNet | 26.1 | No | Depthwise Separable Convolutions for Neural Mach... | 2017-06-09 | Code |
| 63 | Average Attention Network (w/o FFN) | 26.05 | No | Accelerating Neural Transformer via an Average A... | 2018-05-02 | Code |
| 64 | MoE | 26.03 | No | Outrageously Large Neural Networks: The Sparsely... | 2017-01-23 | Code |
| 65 | Average Attention Network (w/o gate) | 25.91 | No | Accelerating Neural Transformer via an Average A... | 2018-05-02 | Code |
| 66 | Adaptively Sparse Transformer (1.5-entmax) | 25.89 | No | Adaptively Sparse Transformers | 2019-08-30 | Code |
| 67 | DenseNMT | 25.52 | No | Dense Information Flow for Neural Machine Transl... | 2018-06-03 | Code |
| 68 | GLAT | 25.21 | No | Glancing Transformer for Non-Autoregressive Neur... | 2020-08-18 | Code |
| 69 | CMLM+LAT+1 iterations | 25.2 | No | Incorporating a Local Translation Mechanism into... | 2020-11-12 | Code |
| 70 | ConvS2S | 25.16 | No | Convolutional Sequence to Sequence Learning | 2017-05-08 | Code |
| 71 | ByteNet | 23.75 | No | Neural Machine Translation in Linear Time | 2016-10-31 | Code |
| 72 | FlowSeq-large (NPD n = 30) | 23.64 | No | FlowSeq: Non-Autoregressive Conditional Sequence... | 2019-09-05 | Code |
| 73 | FlowSeq-large (NPD n = 15) | 23.14 | No | FlowSeq: Non-Autoregressive Conditional Sequence... | 2019-09-05 | Code |
| 74 | FlowSeq-large (IWD n = 15) | 22.94 | No | FlowSeq: Non-Autoregressive Conditional Sequence... | 2019-09-05 | Code |
| 75 | Denoising autoencoders (non-autoregressive) | 21.54 | No | Deterministic Non-Autoregressive Neural Sequence... | 2018-02-19 | Code |
| 76 | RNN Enc-Dec Att | 20.9 | No | Effective Approaches to Attention-based Neural M... | 2015-08-17 | Code |
| 77 | FlowSeq-large | 20.85 | No | FlowSeq: Non-Autoregressive Conditional Sequence... | 2019-09-05 | Code |
| 78 | PBMT | 20.7 | No | - | - | - |
| 79 | Deep-Att | 20.7 | No | Deep Recurrent Models with Fast-Forward Connecti... | 2016-06-14 | Code |
| 80 | Phrase Based MT | 20.7 | No | - | - | - |
| 81 | PBSMT + NMT | 20.23 | No | Phrase-Based & Neural Unsupervised Machine Trans... | 2018-04-20 | Code |
| 82 | NAT +FT + NPD | 19.17 | No | Non-Autoregressive Neural Machine Translation | 2017-11-07 | Code |
| 83 | FlowSeq-base | 18.55 | No | FlowSeq: Non-Autoregressive Conditional Sequence... | 2019-09-05 | Code |
| 84 | Seq-KD + Seq-Inter + Word-KD | 18.5 | No | Sequence-Level Knowledge Distillation | 2016-06-25 | Code |
| 85 | Unsupervised PBSMT | 17.94 | No | Phrase-Based & Neural Unsupervised Machine Trans... | 2018-04-20 | Code |
| 86 | NSE-NSE | 17.9 | No | Neural Semantic Encoders | 2016-07-14 | Code |
| 87 | Unsupervised NMT + Transformer | 17.16 | No | Phrase-Based & Neural Unsupervised Machine Trans... | 2018-04-20 | Code |
| 88 | SMT + iterative backtranslation (unsupervised) | 14.08 | No | Unsupervised Statistical Machine Translation | 2018-09-04 | Code |
| 89 | Reverse RNN Enc-Dec | 14 | No | Effective Approaches to Attention-based Neural M... | 2015-08-17 | Code |
| 90 | RNN Enc-Dec | 11.3 | No | Effective Approaches to Attention-based Neural M... | 2015-08-17 | Code |