Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Coreference Resolution
/
Winograd Schema Challenge
Coreference Resolution on Winograd Schema Challenge
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
PaLM 540B (fine-tuned)
100
No
PaLM: Scaling Language Modeling with Pathways
2022-04-05
Code
2
Vega v2 6B (KD-based prompt transfer)
98.6
No
Toward Efficient Language Model Pretraining and ...
2022-12-04
-
3
UL2 20B (fine-tuned)
98.1
No
UL2: Unifying Language Learning Paradigms
2022-05-10
Code
4
Turing NLR v5 XXL 5.4B (fine-tuned)
97.3
No
Toward Efficient Language Model Pretraining and ...
2022-12-04
-
5
ST-MoE-32B 269B (fine-tuned)
96.6
No
ST-MoE: Designing Stable and Transferable Sparse...
2022-02-17
Code
6
DeBERTa-1.5B
95.9
No
DeBERTa: Decoding-enhanced BERT with Disentangle...
2020-06-05
Code
7
T5-XXL 11B (fine-tuned)
93.8
No
Exploring the Limits of Transfer Learning with a...
2019-10-23
Code
8
ST-MoE-L 4.1B (fine-tuned)
93.3
No
ST-MoE: Designing Stable and Transferable Sparse...
2022-02-17
Code
9
RoBERTa-WinoGrande 355M
90.1
No
WinoGrande: An Adversarial Winograd Schema Chall...
2019-07-24
Code
10
Flan-T5 XXL (zero -shot)
89.82
No
Scaling Instruction-Finetuned Language Models
2022-10-20
Code
11
PaLM 540B (5-shot)
89.5
No
PaLM: Scaling Language Modeling with Pathways
2022-04-05
Code
12
PaLM 540B (0-shot)
89.1
No
PaLM: Scaling Language Modeling with Pathways
2022-04-05
Code
13
PaLM 2-M (1-shot)
88.1
No
PaLM 2 Technical Report
2023-05-17
Code
14
PaLM 2-L (1-shot)
86.9
No
PaLM 2 Technical Report
2023-05-17
Code
15
FLAN 137B (prompt-tuned)
86.5
No
Finetuned Language Models Are Zero-Shot Learners
2021-09-03
Code
16
PaLM 540B (1-shot)
86.3
No
PaLM: Scaling Language Modeling with Pathways
2022-04-05
Code
17
TTTTT 3B (fine-tuned)
84.6
No
TTTTTackling WinoGrande Schemas
2020-03-18
-
18
PaLM 2-S (1-shot)
84.6
No
PaLM 2 Technical Report
2023-05-17
Code
19
RoBERTa-DPR 355M
83.1
No
WinoGrande: An Adversarial Winograd Schema Chall...
2019-07-24
Code
20
FLAN 137B (zero-shot)
80.8
No
Finetuned Language Models Are Zero-Shot Learners
2021-09-03
Code
21
GPT-3 175B (few-shot)
80.1
No
Language Models are Few-Shot Learners
2020-05-28
Code
22
RoBERTa-large + G-DAug-Inf
80
No
Generative Data Augmentation for Commonsense Rea...
2020-04-24
Code
23
UL2 20B (0-shot)
79.9
No
UL2: Unifying Language Learning Paradigms
2022-05-10
Code
24
ALBERT-xxlarge 235M
78.8
No
Back to Square One: Artifact Detection, Training...
2021-04-16
-
25
Neo-6B (QA + WS)
77.9
No
Ask Me Anything: A simple strategy for prompting...
2022-10-05
Code
26
HNN
75.1
No
A Hybrid Neural Network Model for Commonsense Re...
2019-07-27
Code
27
Neo-6B (QA)
74.7
No
Ask Me Anything: A simple strategy for prompting...
2022-10-05
Code
28
RoBERTa-large 354M
73.9
No
Back to Square One: Artifact Detection, Training...
2021-04-16
-
29
GPT-2-XL 1.5B
73.3
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
30
BERTwiki 340M (fine-tuned on WSCR)
72.5
No
A Surprisingly Robust Trick for Winograd Schema ...
2019-05-15
Code
31
BERT-SocialIQA 340M
72.5
No
SocialIQA: Commonsense Reasoning about Social In...
2019-04-22
Code
32
BERT-large 340M (fine-tuned on WSCR)
71.4
No
A Surprisingly Robust Trick for Winograd Schema ...
2019-05-15
Code
33
GPT-2-XL 1.5B
70.7
No
-
-
Code
34
BERTwiki 340M (fine-tuned on half of WSCR)
70.3
No
A Surprisingly Robust Trick for Winograd Schema ...
2019-05-15
Code
35
LaMini-GPT 1.5B
69.6
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
36
GPT-2 Medium 774M (partial scoring)
69.2
No
How Reasonable are Common-Sense Reasoning Tasks:...
2018-11-05
Code
37
N-Grammer 343M
68.3
No
N-Grammer: Augmenting Transformers with latent n...
2022-07-13
Code
38
AlexaTM 20B
68.3
No
AlexaTM 20B: Few-Shot Learning Using a Large-Sca...
2022-08-02
Code
39
BERT-large 340M
67
No
SocialIQA: Commonsense Reasoning about Social In...
2019-04-22
Code
40
T5-Large 738M
66.7
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
41
T0-3B (CoT fine-tuned)
66
No
The CoT Collection: Improving Zero-shot and Few-...
2023-05-23
Code
42
KiC-770M
65.4
No
Knowledge-in-Context: Towards Knowledgeable Semi...
2022-10-28
-
43
GPT-2 Medium 774M (full scoring)
64.5
No
How Reasonable are Common-Sense Reasoning Tasks:...
2018-11-05
Code
44
LaMini-F-T5 783M
64.1
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
45
Ensemble of 14 LMs
63.7
No
A Simple Method for Commonsense Reasoning
2018-06-07
Code
46
H3 125M (3-shot, rank classification)
63.5
No
Hungry Hungry Hippos: Towards Language Modeling ...
2022-12-28
Code
47
DSSM
63
No
Unsupervised Deep Structured Semantic Models for...
2019-04-03
-
48
RoBERTa-base 125M
63
No
Back to Square One: Artifact Detection, Training...
2021-04-16
-
49
Word-level CNN+LSTM (partial scoring)
62.6
No
A Simple Method for Commonsense Reasoning
2018-06-07
Code
50
UDSSM-II (ensemble)
62.4
No
Unsupervised Deep Structured Semantic Models for...
2019-04-03
-
51
BERT-base 110M (fine-tuned on WSCR)
62.3
No
A Surprisingly Robust Trick for Winograd Schema ...
2019-05-15
Code
52
RoE-3B
62.21
No
Exploring the Benefits of Training Expert Langua...
2023-02-07
Code
53
BERT-large 340M
62
No
BERT: Pre-training of Deep Bidirectional Transfo...
2018-10-11
Code
54
GPT-2 Small 117M (partial scoring)
61.5
No
How Reasonable are Common-Sense Reasoning Tasks:...
2018-11-05
Code
55
H3 125M (0-shot, rank classification)
61.5
No
Hungry Hungry Hippos: Towards Language Modeling ...
2022-12-28
Code
56
BERT-large 340M
61.4
No
Back to Square One: Artifact Detection, Training...
2021-04-16
-
57
BERT-base 110M + MAS
60.3
No
Attention Is (not) All You Need for Commonsense ...
2019-05-31
Code
58
longdoc S (OntoNotes + PreCo + LitBank)
60.1
No
On Generalization in Coreference Resolution
2021-09-20
Code
59
longdoc S (ON + PreCo + LitBank + 30k pseudo-singletons)
59.4
No
On Generalization in Coreference Resolution
2021-09-20
Code
60
UDSSM-II
59.2
No
Unsupervised Deep Structured Semantic Models for...
2019-04-03
-
61
LaMini-T5 738M
59
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
62
Flipped-3B
58.37
No
Guess the Instruction! Flipped Learning Makes La...
2022-10-06
Code
63
KEE+NKAM winner of the WSC2016
58.3
No
Commonsense Knowledge Enhanced Embeddings for So...
2016-11-13
-
64
Char-level CNN+LSTM (partial scoring)
57.9
No
A Simple Method for Commonsense Reasoning
2018-06-07
Code
65
UDSSM-I (ensemble)
57.1
No
Unsupervised Deep Structured Semantic Models for...
2019-04-03
-
66
Knowledge Hunter
57.1
No
A Knowledge Hunting Framework for Common Sense R...
2018-10-02
-
67
WKH
57.1
No
WinoGrande: An Adversarial Winograd Schema Chall...
2019-07-24
Code
68
BERT-base 110M
56.5
No
Back to Square One: Artifact Detection, Training...
2021-04-16
-
69
GPT-2 Small 117M (full scoring)
55.7
No
How Reasonable are Common-Sense Reasoning Tasks:...
2018-11-05
Code
70
ALBERT-base 11M
55.4
No
Back to Square One: Artifact Detection, Training...
2021-04-16
-
71
Pythia 12B (0-shot)
54.8
No
Pythia: A Suite for Analyzing Large Language Mod...
2023-04-03
Code
72
UDSSM-I
54.5
No
Unsupervised Deep Structured Semantic Models for...
2019-04-03
-
73
Subword-level Transformer LM
54.1
No
Attention Is All You Need
2017-06-12
Code
74
USSM + Supervised DeepNet + KB
52.8
No
Attention Is (not) All You Need for Commonsense ...
2019-05-31
Code
75
KEE+NKAM on WinoGrande
52.8
No
WinoGrande: An Adversarial Winograd Schema Chall...
2019-07-24
Code
76
USSM + KB
52
No
Attention Is (not) All You Need for Commonsense ...
2019-05-31
Code
77
Random chance baseline
50
No
Back to Square One: Artifact Detection, Training...
2021-04-16
-
78
Hybrid H3 125M (3-shot, logit scoring)
43.3
No
Hungry Hungry Hippos: Towards Language Modeling ...
2022-12-28
Code
79
Pythia 2.8B (0-shot)
38.5
No
Pythia: A Suite for Analyzing Large Language Mod...
2023-04-03
Code
80
Neo-6B (few-shot)
36.5
No
Ask Me Anything: A simple strategy for prompting...
2022-10-05
Code
81
Pythia 6.9B (0-shot)
36.5
No
Pythia: A Suite for Analyzing Large Language Mod...
2023-04-03
Code
82
Pythia 12B (5-shot)
36.5
No
Pythia: A Suite for Analyzing Large Language Mod...
2023-04-03
Code
#1
PaLM 540B (fine-tuned)
SOTA
100
Accuracy
· 2022-04-05
PaLM: Scaling Language Modeling with Pathways
Code
#2
Vega v2 6B (KD-based prompt transfer)
98.6
Accuracy
· 2022-12-04
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE
#3
UL2 20B (fine-tuned)
98.1
Accuracy
· 2022-05-10
UL2: Unifying Language Learning Paradigms
Code
#4
Turing NLR v5 XXL 5.4B (fine-tuned)
97.3
Accuracy
· 2022-12-04
Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE
#5
ST-MoE-32B 269B (fine-tuned)
SOTA
96.6
Accuracy
· 2022-02-17
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Code
#6
DeBERTa-1.5B
SOTA
95.9
Accuracy
· 2020-06-05
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Code
#7
T5-XXL 11B (fine-tuned)
SOTA
93.8
Accuracy
· 2019-10-23
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Code
#8
ST-MoE-L 4.1B (fine-tuned)
93.3
Accuracy
· 2022-02-17
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Code
#9
RoBERTa-WinoGrande 355M
SOTA
90.1
Accuracy
· 2019-07-24
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Code
#10
Flan-T5 XXL (zero -shot)
89.82
Accuracy
· 2022-10-20
Scaling Instruction-Finetuned Language Models
Code
#11
PaLM 540B (5-shot)
89.5
Accuracy
· 2022-04-05
PaLM: Scaling Language Modeling with Pathways
Code
#12
PaLM 540B (0-shot)
89.1
Accuracy
· 2022-04-05
PaLM: Scaling Language Modeling with Pathways
Code
#13
PaLM 2-M (1-shot)
88.1
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#14
PaLM 2-L (1-shot)
86.9
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#15
FLAN 137B (prompt-tuned)
86.5
Accuracy
· 2021-09-03
Finetuned Language Models Are Zero-Shot Learners
Code
#16
PaLM 540B (1-shot)
86.3
Accuracy
· 2022-04-05
PaLM: Scaling Language Modeling with Pathways
Code
#17
TTTTT 3B (fine-tuned)
84.6
Accuracy
· 2020-03-18
TTTTTackling WinoGrande Schemas
#18
PaLM 2-S (1-shot)
84.6
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#19
RoBERTa-DPR 355M
83.1
Accuracy
· 2019-07-24
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Code
#20
FLAN 137B (zero-shot)
80.8
Accuracy
· 2021-09-03
Finetuned Language Models Are Zero-Shot Learners
Code
#21
GPT-3 175B (few-shot)
80.1
Accuracy
· 2020-05-28
Language Models are Few-Shot Learners
Code
#22
RoBERTa-large + G-DAug-Inf
80
Accuracy
· 2020-04-24
Generative Data Augmentation for Commonsense Reasoning
Code
#23
UL2 20B (0-shot)
79.9
Accuracy
· 2022-05-10
UL2: Unifying Language Learning Paradigms
Code
#24
ALBERT-xxlarge 235M
78.8
Accuracy
· 2021-04-16
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
#25
Neo-6B (QA + WS)
77.9
Accuracy
· 2022-10-05
Ask Me Anything: A simple strategy for prompting language models
Code
#26
HNN
75.1
Accuracy
· 2019-07-27
A Hybrid Neural Network Model for Commonsense Reasoning
Code
#27
Neo-6B (QA)
74.7
Accuracy
· 2022-10-05
Ask Me Anything: A simple strategy for prompting language models
Code
#28
RoBERTa-large 354M
73.9
Accuracy
· 2021-04-16
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
#29
GPT-2-XL 1.5B
73.3
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#30
BERTwiki 340M (fine-tuned on WSCR)
72.5
Accuracy
· 2019-05-15
A Surprisingly Robust Trick for Winograd Schema Challenge
Code
#31
BERT-SocialIQA 340M
SOTA
72.5
Accuracy
· 2019-04-22
SocialIQA: Commonsense Reasoning about Social Interactions
Code
#32
BERT-large 340M (fine-tuned on WSCR)
71.4
Accuracy
· 2019-05-15
A Surprisingly Robust Trick for Winograd Schema Challenge
Code
#33
GPT-2-XL 1.5B
70.7
Accuracy
No paper
Code
#34
BERTwiki 340M (fine-tuned on half of WSCR)
70.3
Accuracy
· 2019-05-15
A Surprisingly Robust Trick for Winograd Schema Challenge
Code
#35
LaMini-GPT 1.5B
69.6
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#36
GPT-2 Medium 774M (partial scoring)
SOTA
69.2
Accuracy
· 2018-11-05
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
Code
#37
N-Grammer 343M
68.3
Accuracy
· 2022-07-13
N-Grammer: Augmenting Transformers with latent n-grams
Code
#38
AlexaTM 20B
68.3
Accuracy
· 2022-08-02
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
Code
#39
BERT-large 340M
67
Accuracy
· 2019-04-22
SocialIQA: Commonsense Reasoning about Social Interactions
Code
#40
T5-Large 738M
66.7
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#41
T0-3B (CoT fine-tuned)
66
Accuracy
· 2023-05-23
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
Code
#42
KiC-770M
65.4
Accuracy
· 2022-10-28
Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models
#43
GPT-2 Medium 774M (full scoring)
64.5
Accuracy
· 2018-11-05
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
Code
#44
LaMini-F-T5 783M
64.1
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#45
Ensemble of 14 LMs
SOTA
63.7
Accuracy
· 2018-06-07
A Simple Method for Commonsense Reasoning
Code
#46
H3 125M (3-shot, rank classification)
63.5
Accuracy
· 2022-12-28
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Code
#47
DSSM
63
Accuracy
· 2019-04-03
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning
#48
RoBERTa-base 125M
63
Accuracy
· 2021-04-16
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
#49
Word-level CNN+LSTM (partial scoring)
62.6
Accuracy
· 2018-06-07
A Simple Method for Commonsense Reasoning
Code
#50
UDSSM-II (ensemble)
62.4
Accuracy
· 2019-04-03
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning
#51
BERT-base 110M (fine-tuned on WSCR)
62.3
Accuracy
· 2019-05-15
A Surprisingly Robust Trick for Winograd Schema Challenge
Code
#52
RoE-3B
62.21
Accuracy
· 2023-02-07
Exploring the Benefits of Training Expert Language Models over Instruction Tuning
Code
#53
BERT-large 340M
62
Accuracy
· 2018-10-11
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Code
#54
GPT-2 Small 117M (partial scoring)
61.5
Accuracy
· 2018-11-05
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
Code
#55
H3 125M (0-shot, rank classification)
61.5
Accuracy
· 2022-12-28
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Code
#56
BERT-large 340M
61.4
Accuracy
· 2021-04-16
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
#57
BERT-base 110M + MAS
60.3
Accuracy
· 2019-05-31
Attention Is (not) All You Need for Commonsense Reasoning
Code
#58
longdoc S (OntoNotes + PreCo + LitBank)
60.1
Accuracy
· 2021-09-20
On Generalization in Coreference Resolution
Code
#59
longdoc S (ON + PreCo + LitBank + 30k pseudo-singletons)
59.4
Accuracy
· 2021-09-20
On Generalization in Coreference Resolution
Code
#60
UDSSM-II
59.2
Accuracy
· 2019-04-03
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning
#61
LaMini-T5 738M
59
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#62
Flipped-3B
58.37
Accuracy
· 2022-10-06
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners
Code
#63
KEE+NKAM winner of the WSC2016
SOTA
58.3
Accuracy
· 2016-11-13
Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge
#64
Char-level CNN+LSTM (partial scoring)
57.9
Accuracy
· 2018-06-07
A Simple Method for Commonsense Reasoning
Code
#65
UDSSM-I (ensemble)
57.1
Accuracy
· 2019-04-03
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning
#66
Knowledge Hunter
57.1
Accuracy
· 2018-10-02
A Knowledge Hunting Framework for Common Sense Reasoning
#67
WKH
57.1
Accuracy
· 2019-07-24
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Code
#68
BERT-base 110M
56.5
Accuracy
· 2021-04-16
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
#69
GPT-2 Small 117M (full scoring)
55.7
Accuracy
· 2018-11-05
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
Code
#70
ALBERT-base 11M
55.4
Accuracy
· 2021-04-16
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
#71
Pythia 12B (0-shot)
54.8
Accuracy
· 2023-04-03
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Code
#72
UDSSM-I
54.5
Accuracy
· 2019-04-03
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning
#73
Subword-level Transformer LM
54.1
Accuracy
· 2017-06-12
Attention Is All You Need
Code
#74
USSM + Supervised DeepNet + KB
52.8
Accuracy
· 2019-05-31
Attention Is (not) All You Need for Commonsense Reasoning
Code
#75
KEE+NKAM on WinoGrande
52.8
Accuracy
· 2019-07-24
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Code
#76
USSM + KB
52
Accuracy
· 2019-05-31
Attention Is (not) All You Need for Commonsense Reasoning
Code
#77
Random chance baseline
50
Accuracy
· 2021-04-16
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
#78
Hybrid H3 125M (3-shot, logit scoring)
43.3
Accuracy
· 2022-12-28
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Code
#79
Pythia 2.8B (0-shot)
38.5
Accuracy
· 2023-04-03
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Code
#80
Neo-6B (few-shot)
36.5
Accuracy
· 2022-10-05
Ask Me Anything: A simple strategy for prompting language models
Code
#81
Pythia 6.9B (0-shot)
36.5
Accuracy
· 2023-04-03
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Code
#82
Pythia 12B (5-shot)
36.5
Accuracy
· 2023-04-03
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Code