| 1 | Turing NLR v5 XXL 5.4B (fine-tuned) | 92.6 | No | - | - | - |
| 2 | UnitedSynT5 (3B) | 92.6 | Yes | First Train to Generate, then Generate to Train:... | 2024-12-12 | - |
| 3 | T5 | 92 | No | SMART: Robust and Efficient Fine-Tuning for Pre-... | 2019-11-08 | Code |
| 4 | T5-XXL 11B (fine-tuned) | 92 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 5 | T5-3B | 91.4 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 6 | ALBERT | 91.3 | No | ALBERT: A Lite BERT for Self-supervised Learning... | 2019-09-26 | Code |
| 7 | DeBERTa (large) | 91.1 | No | DeBERTa: Decoding-enhanced BERT with Disentangle... | 2020-06-05 | Code |
| 8 | Adv-RoBERTa ensemble | 91.1 | No | StructBERT: Incorporating Language Structures in... | 2019-08-13 | - |
| 9 | RoBERTa | 90.8 | No | RoBERTa: A Robustly Optimized BERT Pretraining A... | 2019-07-26 | Code |
| 10 | XLNet (single model) | 90.8 | No | XLNet: Generalized Autoregressive Pretraining fo... | 2019-06-19 | Code |
| 11 | RoBERTa-large 355M (MLP quantized vector-wise, fine-tuned) | 90.2 | No | LLM.int8(): 8-bit Matrix Multiplication for Tran... | 2022-08-15 | Code |
| 12 | T5-Large | 89.9 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 13 | PSQ (Chen et al., 2020) | 89.9 | No | A Statistical Framework for Low-bitwidth Trainin... | 2020-10-27 | Code |
| 14 | UnitedSynT5 (335M) | 89.8 | Yes | First Train to Generate, then Generate to Train:... | 2024-12-12 | - |
| 15 | ERNIE 2.0 Large | 88.7 | No | ERNIE 2.0: A Continual Pre-training Framework fo... | 2019-07-29 | Code |
| 16 | SpanBERT | 88.1 | No | SpanBERT: Improving Pre-training by Representing... | 2019-07-24 | Code |
| 17 | BERT-Large | 88 | No | FNet: Mixing Tokens with Fourier Transforms | 2021-05-09 | Code |
| 18 | ASA + RoBERTa | 88 | No | Adversarial Self-Attention for Language Understa... | 2022-06-25 | Code |
| 19 | MT-DNN-ensemble | 87.9 | No | Improving Multi-Task Deep Neural Networks via Kn... | 2019-04-20 | Code |
| 20 | Q-BERT (Shen et al., 2020) | 87.8 | No | Q-BERT: Hessian Based Ultra Low Precision Quanti... | 2019-09-12 | - |
| 21 | Snorkel MeTaL (ensemble) | 87.6 | No | Training Complex Models with Multi-Task Weak Sup... | 2018-10-05 | Code |
| 22 | BigBird | 87.5 | No | Big Bird: Transformers for Longer Sequences | 2020-07-28 | Code |
| 23 | T5-Base | 87.1 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 24 | MT-DNN | 86.7 | No | Multi-Task Deep Neural Networks for Natural Lang... | 2019-01-31 | Code |
| 25 | BERT-LARGE | 86.7 | No | BERT: Pre-training of Deep Bidirectional Transfo... | 2018-10-11 | Code |
| 26 | RealFormer | 86.28 | No | RealFormer: Transformer Likes Residual Attention | 2020-12-21 | Code |
| 27 | gMLP-large | 86.2 | No | Pay Attention to MLPs | 2021-05-17 | Code |
| 28 | ERNIE 2.0 Base | 86.1 | No | ERNIE 2.0: A Continual Pre-training Framework fo... | 2019-07-29 | Code |
| 29 | Q8BERT (Zafrir et al., 2019) | 85.6 | No | Q8BERT: Quantized 8Bit BERT | 2019-10-14 | Code |
| 30 | ASA + BERT-base | 85 | No | Adversarial Self-Attention for Language Understa... | 2022-06-25 | Code |
| 31 | TinyBERT-6 67M | 84.6 | No | TinyBERT: Distilling BERT for Natural Language U... | 2019-09-23 | Code |
| 32 | ELC-BERT-base 98M (zero init) | 84.4 | No | Not all layers are equally as important: Every L... | 2023-11-03 | - |
| 33 | 24hBERT | 84.4 | No | How to Train BERT with an Academic Budget | 2021-04-15 | Code |
| 34 | ERNIE | 84 | No | ERNIE: Enhanced Language Representation with Inf... | 2019-05-17 | Code |
| 35 | Charformer-Tall | 83.7 | No | Charformer: Fast Character Transformers via Grad... | 2021-06-23 | Code |
| 36 | LTG-BERT-base 98M | 83 | No | Not all layers are equally as important: Every L... | 2023-11-03 | - |
| 37 | TinyBERT-4 14.5M | 82.5 | No | TinyBERT: Distilling BERT for Natural Language U... | 2019-09-23 | Code |
| 38 | T5-Small | 82.4 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 39 | MFAE | 82.31 | No | - | - | Code |
| 40 | Finetuned Transformer LM | 82.1 | No | - | - | - |
| 41 | Finetuned Transformer LM | 82.1 | No | - | - | Code |
| 42 | SqueezeBERT | 82 | No | SqueezeBERT: What can computer vision teach NLP ... | 2020-06-19 | Code |
| 43 | GPST(unsupervised generative syntactic LM) | 81.8 | No | Generative Pretrained Structured Transformers: U... | 2024-03-13 | Code |
| 44 | ELC-BERT-small 24M | 79.2 | No | Not all layers are equally as important: Every L... | 2023-11-03 | - |
| 45 | LTG-BERT-small 24M | 78 | No | Not all layers are equally as important: Every L... | 2023-11-03 | - |
| 46 | FNet-Large | 78 | No | FNet: Mixing Tokens with Fourier Transforms | 2021-05-09 | Code |
| 47 | aESIM | 73.9 | No | Attention Boosted Sequential Inference Model | 2018-12-05 | - |
| 48 | T5-Large 738M | 72.4 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 49 | Multi-task BiLSTM + Attn | 72.2 | No | GLUE: A Multi-Task Benchmark and Analysis Platfo... | 2018-04-20 | Code |
| 50 | Stacked Bi-LSTMs (shortcut connections, max-pooling) | 71.4 | No | Combining Similarity Features and Deep Represent... | 2018-11-02 | Code |
| 51 | GenSen | 71.4 | No | Learning General Purpose Distributed Sentence Re... | 2018-03-30 | Code |
| 52 | Bi-LSTM sentence encoder (max-pooling) | 70.7 | No | Combining Similarity Features and Deep Represent... | 2018-11-02 | Code |
| 53 | Stacked Bi-LSTMs (shortcut connections, max-pooling, attention) | 70.7 | No | Combining Similarity Features and Deep Represent... | 2018-11-02 | Code |
| 54 | SWEM-max | 68.2 | No | Baseline Needs More Love: On Simple Word-Embeddi... | 2018-05-24 | Code |
| 55 | LaMini-GPT 1.5B | 67.5 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 56 | LaMini-F-T5 783M | 61.4 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 57 | LaMini-T5 738M | 54.7 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 58 | GPT-2-XL 1.5B | 36.5 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |