Question Answering on TruthfulQA

Metric: MC1 (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	MC1▼	Extra Data	Paper	Date↕	Code
1	GPT-4 (RLHF)	0.59	No	GPT-4 Technical Report	2023-03-15	Code
2	Mistral-7B-Instruct-v0.2 + TruthX	0.56	No	TruthX: Alleviating Hallucinations by Editing La...	2024-02-27	Code
3	LLaMa-2-7B-Chat + TruthX	0.54	No	TruthX: Alleviating Hallucinations by Editing La...	2024-02-27	Code
4	LLaMA-2-Chat-13B + Representation Control (Contrast Vector)	0.54	No	Representation Engineering: A Top-Down Approach ...	2023-10-02	Code
5	LLaMA-2-Chat-7B + Representation Control (Contrast Vector)	0.48	No	Representation Engineering: A Top-Down Approach ...	2023-10-02	Code
6	Vicuna 7B + Inference Time Intervention (ITI)	0.389	No	-	-	-
7	Alpaca 7B + Inference Time Intervention (ITI)	0.319	No	-	-	-
8	Gopher 280B (zero-shot, Our Prompt + Choices)	0.295	No	Scaling Language Models: Methods, Analysis & Ins...	2021-12-08	Code
9	LLaMA 7B + Inference Time Intervention (ITI)	0.288	No	-	-	-
10	GAL 120B	0.26	No	Galactica: A Large Language Model for Science	2022-11-16	Code
11	Gopher 7.1 (zero-shot, QA prompts)	0.25	No	Scaling Language Models: Methods, Analysis & Ins...	2021-12-08	Code
12	GAL 30B	0.24	No	Galactica: A Large Language Model for Science	2022-11-16	Code
13	Gopher 7.1B (zero-shot, Our Prompt + Choices)	0.23	No	Scaling Language Models: Methods, Analysis & Ins...	2021-12-08	Code
14	Gopher 1.4 (zero-shot, QA prompts)	0.23	No	Scaling Language Models: Methods, Analysis & Ins...	2021-12-08	Code
15	GPT-2 1.5B	0.22	No	TruthfulQA: Measuring How Models Mimic Human Fal...	2021-09-08	Code
16	Gopher 1.4B (zero-shot, Our Prompt + Choices)	0.217	No	Scaling Language Models: Methods, Analysis & Ins...	2021-12-08	Code
17	GPT-3 175B	0.21	No	TruthfulQA: Measuring How Models Mimic Human Fal...	2021-09-08	Code
18	OPT 175B	0.21	No	Galactica: A Large Language Model for Science	2022-11-16	Code
19	GPT-J 6B	0.2	No	TruthfulQA: Measuring How Models Mimic Human Fal...	2021-09-08	Code
20	UnifiedQA 3B	0.19	No	TruthfulQA: Measuring How Models Mimic Human Fal...	2021-09-08	Code
21	GAL 125M	0.19	No	Galactica: A Large Language Model for Science	2022-11-16	Code
22	GAL 1.3B	0.19	No	Galactica: A Large Language Model for Science	2022-11-16	Code
23	GAL 6.7B	0.19	No	Galactica: A Large Language Model for Science	2022-11-16	Code

#1GPT-4 (RLHF)SOTA
0.59
MC1· 2023-03-15
GPT-4 Technical Report Code
#2Mistral-7B-Instruct-v0.2 + TruthX
0.56
MC1· 2024-02-27
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space Code
#3LLaMa-2-7B-Chat + TruthX
0.54
MC1· 2024-02-27
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space Code
#4LLaMA-2-Chat-13B + Representation Control (Contrast Vector)
0.54
MC1· 2023-10-02
Representation Engineering: A Top-Down Approach to AI Transparency Code
#5LLaMA-2-Chat-7B + Representation Control (Contrast Vector)
0.48
MC1· 2023-10-02
Representation Engineering: A Top-Down Approach to AI Transparency Code
#6Vicuna 7B + Inference Time Intervention (ITI)
0.389
MC1
No paper
#7Alpaca 7B + Inference Time Intervention (ITI)
0.319
MC1
No paper
#8Gopher 280B (zero-shot, Our Prompt + Choices)SOTA
0.295
MC1· 2021-12-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher Code
#9LLaMA 7B + Inference Time Intervention (ITI)
0.288
MC1
No paper
#10GAL 120B
0.26
MC1· 2022-11-16
Galactica: A Large Language Model for Science Code
#11Gopher 7.1 (zero-shot, QA prompts)
0.25
MC1· 2021-12-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher Code
#12GAL 30B
0.24
MC1· 2022-11-16
Galactica: A Large Language Model for Science Code
#13Gopher 7.1B (zero-shot, Our Prompt + Choices)
0.23
MC1· 2021-12-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher Code
#14Gopher 1.4 (zero-shot, QA prompts)
0.23
MC1· 2021-12-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher Code
#15GPT-2 1.5BSOTA
0.22
MC1· 2021-09-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods Code
#16Gopher 1.4B (zero-shot, Our Prompt + Choices)
0.217
MC1· 2021-12-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher Code
#17GPT-3 175B
0.21
MC1· 2021-09-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods Code
#18OPT 175B
0.21
MC1· 2022-11-16
Galactica: A Large Language Model for Science Code
#19GPT-J 6B
0.2
MC1· 2021-09-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods Code
#20UnifiedQA 3B
0.19
MC1· 2021-09-08
TruthfulQA: Measuring How Models Mimic Human Falsehoods Code
#21GAL 125M
0.19
MC1· 2022-11-16
Galactica: A Large Language Model for Science Code
#22GAL 1.3B
0.19
MC1· 2022-11-16
Galactica: A Large Language Model for Science Code
#23GAL 6.7B
0.19
MC1· 2022-11-16
Galactica: A Large Language Model for Science Code