| 1 | PaLM 540B (finetuned) | 100 | No | PaLM: Scaling Language Modeling with Pathways | 2022-04-05 | Code |
| 2 | Vega v2 6B (KD-based prompt transfer) | 99.4 | No | Toward Efficient Language Model Pretraining and ... | 2022-12-04 | - |
| 3 | ST-MoE-32B 269B (fine-tuned) | 99.2 | No | ST-MoE: Designing Stable and Transferable Sparse... | 2022-02-17 | Code |
| 4 | UL2 20B (fine-tuned) | 99 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |
| 5 | DeBERTa-Ensemble | 98.4 | No | DeBERTa: Decoding-enhanced BERT with Disentangle... | 2020-06-05 | Code |
| 6 | Turing NLR v5 XXL 5.4B (fine-tuned) | 98.2 | No | Toward Efficient Language Model Pretraining and ... | 2022-12-04 | - |
| 7 | DeBERTa-1.5B | 96.8 | No | DeBERTa: Decoding-enhanced BERT with Disentangle... | 2020-06-05 | Code |
| 8 | PaLM 2-L (1-shot) | 96 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 9 | T5-XXL 11B (fine-tuned) | 94.8 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 10 | FLAN 137B (prompt-tuned) | 94 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 11 | GPT-3 175B (few-shot, k=32) | 92 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 12 | T5-XL 3B (fine-tuned) | 92 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 13 | FLAN 137B (zero-shot) | 91 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 14 | ST-MoE-L 4.1B (fine-tuned) | 91 | No | ST-MoE: Designing Stable and Transferable Sparse... | 2022-02-17 | Code |
| 15 | GPT-3 175B (0-shot) | 91 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 16 | T0-3B (CoT fine-tuned) | 90.9 | No | The CoT Collection: Improving Zero-shot and Few-... | 2023-05-23 | Code |
| 17 | RoBERTa-Winogrande-ft 355M (fine-tuned) | 90.6 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 18 | PaLM 2-M (1-shot) | 90 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 19 | Flipped-3B | 89.88 | No | Guess the Instruction! Flipped Learning Makes La... | 2022-10-06 | Code |
| 20 | PaLM 2-S (1-shot) | 89 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 21 | GPT-NeoX (one-shot) | 88 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 22 | FLAN 137B (few-shot, k=16) | 87 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 23 | GPT-3 175B (1-shot) | 87 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 24 | RoBERTa-ft 355M (fine-tuned) | 86.4 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 25 | Bloomberg GPT (one-shot) | 86 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 26 | OPT 66B (one-shot) | 86 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 27 | GPT-3 13B (few-shot, k=32) | 86 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 28 | KiC-770M | 85.3 | No | Knowledge-in-Context: Towards Knowledgeable Semi... | 2022-10-28 | - |
| 29 | UL2 20B (0-shot) | 85 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |
| 30 | RoBERTa-Winogrande 355M (fine-tuned) | 84.4 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 31 | Neo-6B (QA + WS) | 84 | No | Ask Me Anything: A simple strategy for prompting... | 2022-10-05 | Code |
| 32 | BLOOM 176B (one-shot) | 84 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 33 | T5-Large 770M (fine-tuned) | 83.4 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 34 | BERT-SocialIQA 340M | 83.4 | No | SocialIQA: Commonsense Reasoning about Social In... | 2019-04-22 | Code |
| 35 | Hybrid H3 2.7B (0-shot, logit scoring) | 81 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 36 | BERT-large 340M | 80.8 | No | SocialIQA: Commonsense Reasoning about Social In... | 2019-04-22 | Code |
| 37 | RoE-3B | 79.25 | No | Exploring the Benefits of Training Expert Langua... | 2023-02-07 | Code |
| 38 | sMLP – deterministic 9.4B (0-shot) | 79 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 39 | KELM (finetuning BERT-large based single model) | 78 | No | KELM: Knowledge Enhanced Pre-Trained Language Re... | 2021-09-09 | Code |
| 40 | AlexaTM 20B | 78 | No | AlexaTM 20B: Few-Shot Learning Using a Large-Sca... | 2022-08-02 | Code |
| 41 | Neo-6B (few-shot) | 77 | No | Ask Me Anything: A simple strategy for prompting... | 2022-10-05 | Code |
| 42 | Hybrid H3 2.7B (3-shot, logit scoring) | 77 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 43 | Causal Strength w/multi-word predicates (presumably on WinoGrande?) | 76.4 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 44 | Gshard 9B | 76 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 45 | Switch Transformer 9B | 75 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 46 | GPT-3 Large 760M (0-shot) | 73 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 47 | Causal Strength Computation w/multi-word predicates (on ClueWeb12) | 71.2 | No | - | - | - |
| 48 | T5-Base 220M (fine-tuned) | 71.2 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 49 | Causal Strength Computation (on Causal Net) | 70.2 | No | - | - | - |
| 50 | Causal Strength Computation (on ClueWeb12) | 69.9 | No | - | - | - |
| 51 | Hybrid H3 125M (0-shot, logit scoring) | 67 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 52 | Hybrid H3 125M (0-shot, rank classification) | 67 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 53 | Pointwise Mutual Information (on 10M stories) | 65.4 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 54 | HASH Layers 10B (0-shot) | 64 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 55 | Base Layers 10B (0-shot) | 63 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 56 | N-Grammer 343M | 60 | No | N-Grammer: Augmenting Transformers with latent n... | 2022-07-13 | Code |
| 57 | Pointwise Mutual Information (on Project Gutenberg) | 58.8 | No | - | - | - |
| 58 | Neo-6B (QA) | 58.2 | No | Ask Me Anything: A simple strategy for prompting... | 2022-10-05 | Code |
| 59 | H3 125M (0-shot, rank classification) | 51 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 60 | Random chance baseline | 50 | No | Back to Square One: Artifact Detection, Training... | 2021-04-16 | - |