Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Question Answering
/
OpenBookQA
Question Answering on OpenBookQA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
GPT-4 + knowledge base
95.9
No
-
-
-
2
MVP-Tuning (ensemble)
95.2
No
-
-
-
3
PaLM 540B (Self Improvement, Self Consistency)
94.4
No
Large Language Models Can Self-Improve
2022-10-20
-
4
X-Reasoner
94.2
No
-
-
-
5
PaLM 540B (Self Improvement, CoT Prompting)
93
No
Large Language Models Can Self-Improve
2022-10-20
-
6
PaLM 540B (Self Improvement, Standard-Prompting)
92
No
Large Language Models Can Self-Improve
2022-10-20
-
7
DeBERTa-xxlarge 1.5B + MVP-Tuning
91.3
No
-
-
-
8
PaLM 540B (Self Consistency)
90
No
Large Language Models Can Self-Improve
2022-10-20
-
9
GrapeQA: PEGA+CANP
90
No
GrapeQA: GRaph Augmentation and Pruning to Enhan...
2023-03-22
-
10
GenMC 11B
89.8
No
Clues Before Answers: Generation-Enhanced Multip...
2022-04-30
Code
11
AristoRoBERTa + MVP-Tuning
87.6
No
-
-
-
12
AristoRoBERTa + Graph Soft Counter
87.4
No
GNN is a Counter? Revisiting GNN for Question An...
2021-10-07
-
13
UnifiedQA 11B
87.2
No
UnifiedQA: Crossing Format Boundaries With a Sin...
2020-05-02
Code
14
LLaMA-3 8B+MoSLoRA
86.8
No
Mixture-of-Subspaces in Low-Rank Adaptation
2024-06-16
Code
15
PaLM 540B (CoT Prompting)
86.4
No
Large Language Models Can Self-Improve
2022-10-20
-
16
LLaMA-3 8B + MixLoRA
84.8
No
MixLoRA: Enhancing Large Language Models Fine-Tu...
2024-04-22
Code
17
PaLM 540B (Standard-Prompting)
84.4
No
Large Language Models Can Self-Improve
2022-10-20
-
18
TTTTT 3B
83.2
No
Fusing Context Into Knowledge Graph for Commonse...
2020-12-09
Code
19
LLaMA-2 13B + MixLoRA
83
No
MixLoRA: Enhancing Large Language Models Fine-Tu...
2024-04-22
Code
20
AristoRoBERTa + QA-GNN
82.8
No
QA-GNN: Reasoning with Language Models and Knowl...
2021-04-13
Code
21
QA-GNN
82.8
No
QA-GNN: Reasoning with Language Models and Knowl...
2021-04-13
Code
22
DEKCOR
82.4
No
Fusing Context Into Knowledge Graph for Commonse...
2020-12-09
Code
23
GrapeQA: PEGA
82
No
GrapeQA: GRaph Augmentation and Pruning to Enhan...
2023-03-22
-
24
LLaMA-2 7B + MixLoRA
81.6
No
MixLoRA: Enhancing Large Language Models Fine-Tu...
2024-04-22
Code
25
AristoRoBERTa
77.8
No
QA-GNN: Reasoning with Language Models and Knowl...
2021-04-13
Code
26
BiLSTM max-out question-match (science fact + common knowledge fact)
76.9
No
Can a Suit of Armor Conduct Electricity? A New D...
2018-09-08
Code
27
Careful Selection
72
No
Careful Selection of Knowledge to solve Open Boo...
2019-07-24
-
28
GrapeQA: CANP
66.2
No
GrapeQA: GRaph Augmentation and Pruning to Enhan...
2023-03-22
-
29
GPT-3 175B (few-shot, k=32)
65.4
No
Language Models are Few-Shot Learners
2020-05-28
Code
30
PaLM 2-L (1-shot)
58.5
No
PaLM 2 Technical Report
2023-05-17
Code
31
OPT 66B (one-shot)
58
No
BloombergGPT: A Large Language Model for Finance
2023-03-30
Code
32
PaLM 2-S (1-shot)
57.4
No
PaLM 2 Technical Report
2023-05-17
Code
33
BiLSTM max-out question-match (WordNet + science fact)
56.3
No
Can a Suit of Armor Conduct Electricity? A New D...
2018-09-08
Code
34
PaLM 2-M (1-shot)
56.2
No
PaLM 2 Technical Report
2023-05-17
Code
35
BiLSTM max-out question-match (with a science fact)
55.8
No
Can a Suit of Armor Conduct Electricity? A New D...
2018-09-08
Code
36
Bloomberg GPT 50B (1-shot)
51.6
No
BloombergGPT: A Large Language Model for Finance
2023-03-30
Code
37
BLOOM 176B (2-shot)
47.2
No
BloombergGPT: A Large Language Model for Finance
2023-03-30
Code
38
GPT-NeoX 50B (2-shot)
44.2
No
BloombergGPT: A Large Language Model for Finance
2023-03-30
Code
39
LaMini-GPT 1.5B
39.8
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
40
LaMini-T5 738M
36
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
41
LaMini-F-T5 783M
34
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
42
T5-Large 738M
32.8
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
43
GPT-2-XL 1.5B
32
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
44
FLAN-T5-Large 783M
31.2
No
LaMini-LM: A Diverse Herd of Distilled Models fr...
2023-04-27
Code
45
Random chance baseline
25
No
HellaSwag: Can a Machine Really Finish Your Sent...
2019-05-19
Code
#1
GPT-4 + knowledge base
95.9
Accuracy
No paper
#2
MVP-Tuning (ensemble)
95.2
Accuracy
No paper
#3
PaLM 540B (Self Improvement, Self Consistency)
SOTA
94.4
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#4
X-Reasoner
94.2
Accuracy
No paper
#5
PaLM 540B (Self Improvement, CoT Prompting)
93
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#6
PaLM 540B (Self Improvement, Standard-Prompting)
92
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#7
DeBERTa-xxlarge 1.5B + MVP-Tuning
91.3
Accuracy
No paper
#8
PaLM 540B (Self Consistency)
90
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#9
GrapeQA: PEGA+CANP
90
Accuracy
· 2023-03-22
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
#10
GenMC 11B
SOTA
89.8
Accuracy
· 2022-04-30
Clues Before Answers: Generation-Enhanced Multiple-Choice QA
Code
#11
AristoRoBERTa + MVP-Tuning
87.6
Accuracy
No paper
#12
AristoRoBERTa + Graph Soft Counter
SOTA
87.4
Accuracy
· 2021-10-07
GNN is a Counter? Revisiting GNN for Question Answering
#13
UnifiedQA 11B
SOTA
87.2
Accuracy
· 2020-05-02
UnifiedQA: Crossing Format Boundaries With a Single QA System
Code
#14
LLaMA-3 8B+MoSLoRA
86.8
Accuracy
· 2024-06-16
Mixture-of-Subspaces in Low-Rank Adaptation
Code
#15
PaLM 540B (CoT Prompting)
86.4
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#16
LLaMA-3 8B + MixLoRA
84.8
Accuracy
· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
Code
#17
PaLM 540B (Standard-Prompting)
84.4
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#18
TTTTT 3B
83.2
Accuracy
· 2020-12-09
Fusing Context Into Knowledge Graph for Commonsense Question Answering
Code
#19
LLaMA-2 13B + MixLoRA
83
Accuracy
· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
Code
#20
AristoRoBERTa + QA-GNN
82.8
Accuracy
· 2021-04-13
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering
Code
#21
QA-GNN
82.8
Accuracy
· 2021-04-13
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering
Code
#22
DEKCOR
82.4
Accuracy
· 2020-12-09
Fusing Context Into Knowledge Graph for Commonsense Question Answering
Code
#23
GrapeQA: PEGA
82
Accuracy
· 2023-03-22
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
#24
LLaMA-2 7B + MixLoRA
81.6
Accuracy
· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
Code
#25
AristoRoBERTa
77.8
Accuracy
· 2021-04-13
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering
Code
#26
BiLSTM max-out question-match (science fact + common knowledge fact)
SOTA
76.9
Accuracy
· 2018-09-08
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Code
#27
Careful Selection
72
Accuracy
· 2019-07-24
Careful Selection of Knowledge to solve Open Book Question Answering
#28
GrapeQA: CANP
66.2
Accuracy
· 2023-03-22
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
#29
GPT-3 175B (few-shot, k=32)
65.4
Accuracy
· 2020-05-28
Language Models are Few-Shot Learners
Code
#30
PaLM 2-L (1-shot)
58.5
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#31
OPT 66B (one-shot)
58
Accuracy
· 2023-03-30
BloombergGPT: A Large Language Model for Finance
Code
#32
PaLM 2-S (1-shot)
57.4
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#33
BiLSTM max-out question-match (WordNet + science fact)
56.3
Accuracy
· 2018-09-08
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Code
#34
PaLM 2-M (1-shot)
56.2
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#35
BiLSTM max-out question-match (with a science fact)
55.8
Accuracy
· 2018-09-08
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Code
#36
Bloomberg GPT 50B (1-shot)
51.6
Accuracy
· 2023-03-30
BloombergGPT: A Large Language Model for Finance
Code
#37
BLOOM 176B (2-shot)
47.2
Accuracy
· 2023-03-30
BloombergGPT: A Large Language Model for Finance
Code
#38
GPT-NeoX 50B (2-shot)
44.2
Accuracy
· 2023-03-30
BloombergGPT: A Large Language Model for Finance
Code
#39
LaMini-GPT 1.5B
39.8
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#40
LaMini-T5 738M
36
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#41
LaMini-F-T5 783M
34
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#42
T5-Large 738M
32.8
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#43
GPT-2-XL 1.5B
32
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#44
FLAN-T5-Large 783M
31.2
Accuracy
· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Code
#45
Random chance baseline
25
Accuracy
· 2019-05-19
HellaSwag: Can a Machine Really Finish Your Sentence?
Code