Question Answering on OpenBookQA

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	GPT-4 + knowledge base	95.9	No	-	-	-
2	MVP-Tuning (ensemble)	95.2	No	-	-	-
3	PaLM 540B (Self Improvement, Self Consistency)	94.4	No	Large Language Models Can Self-Improve	2022-10-20	-
4	X-Reasoner	94.2	No	-	-	-
5	PaLM 540B (Self Improvement, CoT Prompting)	93	No	Large Language Models Can Self-Improve	2022-10-20	-
6	PaLM 540B (Self Improvement, Standard-Prompting)	92	No	Large Language Models Can Self-Improve	2022-10-20	-
7	DeBERTa-xxlarge 1.5B + MVP-Tuning	91.3	No	-	-	-
8	PaLM 540B (Self Consistency)	90	No	Large Language Models Can Self-Improve	2022-10-20	-
9	GrapeQA: PEGA+CANP	90	No	GrapeQA: GRaph Augmentation and Pruning to Enhan...	2023-03-22	-
10	GenMC 11B	89.8	No	Clues Before Answers: Generation-Enhanced Multip...	2022-04-30	Code
11	AristoRoBERTa + MVP-Tuning	87.6	No	-	-	-
12	AristoRoBERTa + Graph Soft Counter	87.4	No	GNN is a Counter? Revisiting GNN for Question An...	2021-10-07	-
13	UnifiedQA 11B	87.2	No	UnifiedQA: Crossing Format Boundaries With a Sin...	2020-05-02	Code
14	LLaMA-3 8B+MoSLoRA	86.8	No	Mixture-of-Subspaces in Low-Rank Adaptation	2024-06-16	Code
15	PaLM 540B (CoT Prompting)	86.4	No	Large Language Models Can Self-Improve	2022-10-20	-
16	LLaMA-3 8B + MixLoRA	84.8	No	MixLoRA: Enhancing Large Language Models Fine-Tu...	2024-04-22	Code
17	PaLM 540B (Standard-Prompting)	84.4	No	Large Language Models Can Self-Improve	2022-10-20	-
18	TTTTT 3B	83.2	No	Fusing Context Into Knowledge Graph for Commonse...	2020-12-09	Code
19	LLaMA-2 13B + MixLoRA	83	No	MixLoRA: Enhancing Large Language Models Fine-Tu...	2024-04-22	Code
20	AristoRoBERTa + QA-GNN	82.8	No	QA-GNN: Reasoning with Language Models and Knowl...	2021-04-13	Code
21	QA-GNN	82.8	No	QA-GNN: Reasoning with Language Models and Knowl...	2021-04-13	Code
22	DEKCOR	82.4	No	Fusing Context Into Knowledge Graph for Commonse...	2020-12-09	Code
23	GrapeQA: PEGA	82	No	GrapeQA: GRaph Augmentation and Pruning to Enhan...	2023-03-22	-
24	LLaMA-2 7B + MixLoRA	81.6	No	MixLoRA: Enhancing Large Language Models Fine-Tu...	2024-04-22	Code
25	AristoRoBERTa	77.8	No	QA-GNN: Reasoning with Language Models and Knowl...	2021-04-13	Code
26	BiLSTM max-out question-match (science fact + common knowledge fact)	76.9	No	Can a Suit of Armor Conduct Electricity? A New D...	2018-09-08	Code
27	Careful Selection	72	No	Careful Selection of Knowledge to solve Open Boo...	2019-07-24	-
28	GrapeQA: CANP	66.2	No	GrapeQA: GRaph Augmentation and Pruning to Enhan...	2023-03-22	-
29	GPT-3 175B (few-shot, k=32)	65.4	No	Language Models are Few-Shot Learners	2020-05-28	Code
30	PaLM 2-L (1-shot)	58.5	No	PaLM 2 Technical Report	2023-05-17	Code
31	OPT 66B (one-shot)	58	No	BloombergGPT: A Large Language Model for Finance	2023-03-30	Code
32	PaLM 2-S (1-shot)	57.4	No	PaLM 2 Technical Report	2023-05-17	Code
33	BiLSTM max-out question-match (WordNet + science fact)	56.3	No	Can a Suit of Armor Conduct Electricity? A New D...	2018-09-08	Code
34	PaLM 2-M (1-shot)	56.2	No	PaLM 2 Technical Report	2023-05-17	Code
35	BiLSTM max-out question-match (with a science fact)	55.8	No	Can a Suit of Armor Conduct Electricity? A New D...	2018-09-08	Code
36	Bloomberg GPT 50B (1-shot)	51.6	No	BloombergGPT: A Large Language Model for Finance	2023-03-30	Code
37	BLOOM 176B (2-shot)	47.2	No	BloombergGPT: A Large Language Model for Finance	2023-03-30	Code
38	GPT-NeoX 50B (2-shot)	44.2	No	BloombergGPT: A Large Language Model for Finance	2023-03-30	Code
39	LaMini-GPT 1.5B	39.8	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
40	LaMini-T5 738M	36	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
41	LaMini-F-T5 783M	34	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
42	T5-Large 738M	32.8	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
43	GPT-2-XL 1.5B	32	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
44	FLAN-T5-Large 783M	31.2	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
45	Random chance baseline	25	No	HellaSwag: Can a Machine Really Finish Your Sent...	2019-05-19	Code

#1GPT-4 + knowledge base
95.9
Accuracy
No paper
#2MVP-Tuning (ensemble)
95.2
Accuracy
No paper
#3PaLM 540B (Self Improvement, Self Consistency)SOTA
94.4
Accuracy· 2022-10-20
Large Language Models Can Self-Improve
#4X-Reasoner
94.2
Accuracy
No paper
#5PaLM 540B (Self Improvement, CoT Prompting)
93
Accuracy· 2022-10-20
Large Language Models Can Self-Improve
#6PaLM 540B (Self Improvement, Standard-Prompting)
92
Accuracy· 2022-10-20
Large Language Models Can Self-Improve
#7DeBERTa-xxlarge 1.5B + MVP-Tuning
91.3
Accuracy
No paper
#8PaLM 540B (Self Consistency)
90
Accuracy· 2022-10-20
Large Language Models Can Self-Improve
#9GrapeQA: PEGA+CANP
90
Accuracy· 2023-03-22
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
#10GenMC 11BSOTA
89.8
Accuracy· 2022-04-30
Clues Before Answers: Generation-Enhanced Multiple-Choice QA Code
#11AristoRoBERTa + MVP-Tuning
87.6
Accuracy
No paper
#12AristoRoBERTa + Graph Soft CounterSOTA
87.4
Accuracy· 2021-10-07
GNN is a Counter? Revisiting GNN for Question Answering
#13UnifiedQA 11BSOTA
87.2
Accuracy· 2020-05-02
UnifiedQA: Crossing Format Boundaries With a Single QA System Code
#14LLaMA-3 8B+MoSLoRA
86.8
Accuracy· 2024-06-16
Mixture-of-Subspaces in Low-Rank Adaptation Code
#15PaLM 540B (CoT Prompting)
86.4
Accuracy· 2022-10-20
Large Language Models Can Self-Improve
#16LLaMA-3 8B + MixLoRA
84.8
Accuracy· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts Code
#17PaLM 540B (Standard-Prompting)
84.4
Accuracy· 2022-10-20
Large Language Models Can Self-Improve
#18TTTTT 3B
83.2
Accuracy· 2020-12-09
Fusing Context Into Knowledge Graph for Commonsense Question Answering Code
#19LLaMA-2 13B + MixLoRA
83
Accuracy· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts Code
#20AristoRoBERTa + QA-GNN
82.8
Accuracy· 2021-04-13
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering Code
#21QA-GNN
82.8
Accuracy· 2021-04-13
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering Code
#22DEKCOR
82.4
Accuracy· 2020-12-09
Fusing Context Into Knowledge Graph for Commonsense Question Answering Code
#23GrapeQA: PEGA
82
Accuracy· 2023-03-22
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
#24LLaMA-2 7B + MixLoRA
81.6
Accuracy· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts Code
#25AristoRoBERTa
77.8
Accuracy· 2021-04-13
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering Code
#26BiLSTM max-out question-match (science fact + common knowledge fact)SOTA
76.9
Accuracy· 2018-09-08
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering Code
#27Careful Selection
72
Accuracy· 2019-07-24
Careful Selection of Knowledge to solve Open Book Question Answering
#28GrapeQA: CANP
66.2
Accuracy· 2023-03-22
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
#29GPT-3 175B (few-shot, k=32)
65.4
Accuracy· 2020-05-28
Language Models are Few-Shot Learners Code
#30PaLM 2-L (1-shot)
58.5
Accuracy· 2023-05-17
PaLM 2 Technical Report Code
#31OPT 66B (one-shot)
58
Accuracy· 2023-03-30
BloombergGPT: A Large Language Model for Finance Code
#32PaLM 2-S (1-shot)
57.4
Accuracy· 2023-05-17
PaLM 2 Technical Report Code
#33BiLSTM max-out question-match (WordNet + science fact)
56.3
Accuracy· 2018-09-08
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering Code
#34PaLM 2-M (1-shot)
56.2
Accuracy· 2023-05-17
PaLM 2 Technical Report Code
#35BiLSTM max-out question-match (with a science fact)
55.8
Accuracy· 2018-09-08
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering Code
#36Bloomberg GPT 50B (1-shot)
51.6
Accuracy· 2023-03-30
BloombergGPT: A Large Language Model for Finance Code
#37BLOOM 176B (2-shot)
47.2
Accuracy· 2023-03-30
BloombergGPT: A Large Language Model for Finance Code
#38GPT-NeoX 50B (2-shot)
44.2
Accuracy· 2023-03-30
BloombergGPT: A Large Language Model for Finance Code
#39LaMini-GPT 1.5B
39.8
Accuracy· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#40LaMini-T5 738M
36
Accuracy· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#41LaMini-F-T5 783M
34
Accuracy· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#42T5-Large 738M
32.8
Accuracy· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#43GPT-2-XL 1.5B
32
Accuracy· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#44FLAN-T5-Large 783M
31.2
Accuracy· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#45Random chance baseline
25
Accuracy· 2019-05-19
HellaSwag: Can a Machine Really Finish Your Sentence?Code