| 1 | PaLM 540B (fine-tuned) | 100 | No | PaLM: Scaling Language Modeling with Pathways | 2022-04-05 | Code |
| 2 | Vega v2 6B (KD-based prompt transfer) | 98.6 | No | Toward Efficient Language Model Pretraining and ... | 2022-12-04 | - |
| 3 | UL2 20B (fine-tuned) | 98.1 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |
| 4 | Turing NLR v5 XXL 5.4B (fine-tuned) | 97.3 | No | Toward Efficient Language Model Pretraining and ... | 2022-12-04 | - |
| 5 | ST-MoE-32B 269B (fine-tuned) | 96.6 | No | ST-MoE: Designing Stable and Transferable Sparse... | 2022-02-17 | Code |
| 6 | DeBERTa-1.5B | 95.9 | No | DeBERTa: Decoding-enhanced BERT with Disentangle... | 2020-06-05 | Code |
| 7 | T5-XXL 11B (fine-tuned) | 93.8 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 8 | ST-MoE-L 4.1B (fine-tuned) | 93.3 | No | ST-MoE: Designing Stable and Transferable Sparse... | 2022-02-17 | Code |
| 9 | RoBERTa-WinoGrande 355M | 90.1 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 10 | Flan-T5 XXL (zero -shot) | 89.82 | No | Scaling Instruction-Finetuned Language Models | 2022-10-20 | Code |
| 11 | PaLM 540B (5-shot) | 89.5 | No | PaLM: Scaling Language Modeling with Pathways | 2022-04-05 | Code |
| 12 | PaLM 540B (0-shot) | 89.1 | No | PaLM: Scaling Language Modeling with Pathways | 2022-04-05 | Code |
| 13 | PaLM 2-M (1-shot) | 88.1 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 14 | PaLM 2-L (1-shot) | 86.9 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 15 | FLAN 137B (prompt-tuned) | 86.5 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 16 | PaLM 540B (1-shot) | 86.3 | No | PaLM: Scaling Language Modeling with Pathways | 2022-04-05 | Code |
| 17 | TTTTT 3B (fine-tuned) | 84.6 | No | TTTTTackling WinoGrande Schemas | 2020-03-18 | - |
| 18 | PaLM 2-S (1-shot) | 84.6 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 19 | RoBERTa-DPR 355M | 83.1 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 20 | FLAN 137B (zero-shot) | 80.8 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 21 | GPT-3 175B (few-shot) | 80.1 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 22 | RoBERTa-large + G-DAug-Inf | 80 | No | Generative Data Augmentation for Commonsense Rea... | 2020-04-24 | Code |
| 23 | UL2 20B (0-shot) | 79.9 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |
| 24 | ALBERT-xxlarge 235M | 78.8 | No | Back to Square One: Artifact Detection, Training... | 2021-04-16 | - |
| 25 | Neo-6B (QA + WS) | 77.9 | No | Ask Me Anything: A simple strategy for prompting... | 2022-10-05 | Code |
| 26 | HNN | 75.1 | No | A Hybrid Neural Network Model for Commonsense Re... | 2019-07-27 | Code |
| 27 | Neo-6B (QA) | 74.7 | No | Ask Me Anything: A simple strategy for prompting... | 2022-10-05 | Code |
| 28 | RoBERTa-large 354M | 73.9 | No | Back to Square One: Artifact Detection, Training... | 2021-04-16 | - |
| 29 | GPT-2-XL 1.5B | 73.3 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 30 | BERTwiki 340M (fine-tuned on WSCR) | 72.5 | No | A Surprisingly Robust Trick for Winograd Schema ... | 2019-05-15 | Code |
| 31 | BERT-SocialIQA 340M | 72.5 | No | SocialIQA: Commonsense Reasoning about Social In... | 2019-04-22 | Code |
| 32 | BERT-large 340M (fine-tuned on WSCR) | 71.4 | No | A Surprisingly Robust Trick for Winograd Schema ... | 2019-05-15 | Code |
| 33 | GPT-2-XL 1.5B | 70.7 | No | - | - | Code |
| 34 | BERTwiki 340M (fine-tuned on half of WSCR) | 70.3 | No | A Surprisingly Robust Trick for Winograd Schema ... | 2019-05-15 | Code |
| 35 | LaMini-GPT 1.5B | 69.6 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 36 | GPT-2 Medium 774M (partial scoring) | 69.2 | No | How Reasonable are Common-Sense Reasoning Tasks:... | 2018-11-05 | Code |
| 37 | N-Grammer 343M | 68.3 | No | N-Grammer: Augmenting Transformers with latent n... | 2022-07-13 | Code |
| 38 | AlexaTM 20B | 68.3 | No | AlexaTM 20B: Few-Shot Learning Using a Large-Sca... | 2022-08-02 | Code |
| 39 | BERT-large 340M | 67 | No | SocialIQA: Commonsense Reasoning about Social In... | 2019-04-22 | Code |
| 40 | T5-Large 738M | 66.7 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 41 | T0-3B (CoT fine-tuned) | 66 | No | The CoT Collection: Improving Zero-shot and Few-... | 2023-05-23 | Code |
| 42 | KiC-770M | 65.4 | No | Knowledge-in-Context: Towards Knowledgeable Semi... | 2022-10-28 | - |
| 43 | GPT-2 Medium 774M (full scoring) | 64.5 | No | How Reasonable are Common-Sense Reasoning Tasks:... | 2018-11-05 | Code |
| 44 | LaMini-F-T5 783M | 64.1 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 45 | Ensemble of 14 LMs | 63.7 | No | A Simple Method for Commonsense Reasoning | 2018-06-07 | Code |
| 46 | H3 125M (3-shot, rank classification) | 63.5 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 47 | DSSM | 63 | No | Unsupervised Deep Structured Semantic Models for... | 2019-04-03 | - |
| 48 | RoBERTa-base 125M | 63 | No | Back to Square One: Artifact Detection, Training... | 2021-04-16 | - |
| 49 | Word-level CNN+LSTM (partial scoring) | 62.6 | No | A Simple Method for Commonsense Reasoning | 2018-06-07 | Code |
| 50 | UDSSM-II (ensemble) | 62.4 | No | Unsupervised Deep Structured Semantic Models for... | 2019-04-03 | - |
| 51 | BERT-base 110M (fine-tuned on WSCR) | 62.3 | No | A Surprisingly Robust Trick for Winograd Schema ... | 2019-05-15 | Code |
| 52 | RoE-3B | 62.21 | No | Exploring the Benefits of Training Expert Langua... | 2023-02-07 | Code |
| 53 | BERT-large 340M | 62 | No | BERT: Pre-training of Deep Bidirectional Transfo... | 2018-10-11 | Code |
| 54 | GPT-2 Small 117M (partial scoring) | 61.5 | No | How Reasonable are Common-Sense Reasoning Tasks:... | 2018-11-05 | Code |
| 55 | H3 125M (0-shot, rank classification) | 61.5 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 56 | BERT-large 340M | 61.4 | No | Back to Square One: Artifact Detection, Training... | 2021-04-16 | - |
| 57 | BERT-base 110M + MAS | 60.3 | No | Attention Is (not) All You Need for Commonsense ... | 2019-05-31 | Code |
| 58 | longdoc S (OntoNotes + PreCo + LitBank) | 60.1 | No | On Generalization in Coreference Resolution | 2021-09-20 | Code |
| 59 | longdoc S (ON + PreCo + LitBank + 30k pseudo-singletons) | 59.4 | No | On Generalization in Coreference Resolution | 2021-09-20 | Code |
| 60 | UDSSM-II | 59.2 | No | Unsupervised Deep Structured Semantic Models for... | 2019-04-03 | - |
| 61 | LaMini-T5 738M | 59 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 62 | Flipped-3B | 58.37 | No | Guess the Instruction! Flipped Learning Makes La... | 2022-10-06 | Code |
| 63 | KEE+NKAM winner of the WSC2016 | 58.3 | No | Commonsense Knowledge Enhanced Embeddings for So... | 2016-11-13 | - |
| 64 | Char-level CNN+LSTM (partial scoring) | 57.9 | No | A Simple Method for Commonsense Reasoning | 2018-06-07 | Code |
| 65 | UDSSM-I (ensemble) | 57.1 | No | Unsupervised Deep Structured Semantic Models for... | 2019-04-03 | - |
| 66 | Knowledge Hunter | 57.1 | No | A Knowledge Hunting Framework for Common Sense R... | 2018-10-02 | - |
| 67 | WKH | 57.1 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 68 | BERT-base 110M | 56.5 | No | Back to Square One: Artifact Detection, Training... | 2021-04-16 | - |
| 69 | GPT-2 Small 117M (full scoring) | 55.7 | No | How Reasonable are Common-Sense Reasoning Tasks:... | 2018-11-05 | Code |
| 70 | ALBERT-base 11M | 55.4 | No | Back to Square One: Artifact Detection, Training... | 2021-04-16 | - |
| 71 | Pythia 12B (0-shot) | 54.8 | No | Pythia: A Suite for Analyzing Large Language Mod... | 2023-04-03 | Code |
| 72 | UDSSM-I | 54.5 | No | Unsupervised Deep Structured Semantic Models for... | 2019-04-03 | - |
| 73 | Subword-level Transformer LM | 54.1 | No | Attention Is All You Need | 2017-06-12 | Code |
| 74 | USSM + Supervised DeepNet + KB | 52.8 | No | Attention Is (not) All You Need for Commonsense ... | 2019-05-31 | Code |
| 75 | KEE+NKAM on WinoGrande | 52.8 | No | WinoGrande: An Adversarial Winograd Schema Chall... | 2019-07-24 | Code |
| 76 | USSM + KB | 52 | No | Attention Is (not) All You Need for Commonsense ... | 2019-05-31 | Code |
| 77 | Random chance baseline | 50 | No | Back to Square One: Artifact Detection, Training... | 2021-04-16 | - |
| 78 | Hybrid H3 125M (3-shot, logit scoring) | 43.3 | No | Hungry Hungry Hippos: Towards Language Modeling ... | 2022-12-28 | Code |
| 79 | Pythia 2.8B (0-shot) | 38.5 | No | Pythia: A Suite for Analyzing Large Language Mod... | 2023-04-03 | Code |
| 80 | Neo-6B (few-shot) | 36.5 | No | Ask Me Anything: A simple strategy for prompting... | 2022-10-05 | Code |
| 81 | Pythia 6.9B (0-shot) | 36.5 | No | Pythia: A Suite for Analyzing Large Language Mod... | 2023-04-03 | Code |
| 82 | Pythia 12B (5-shot) | 36.5 | No | Pythia: A Suite for Analyzing Large Language Mod... | 2023-04-03 | Code |