Question Answering on Natural Questions

Metric: EM (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	EM▼	Extra Data	Paper	Date↕	Code
1	Atlas (full, Wiki-dec-2018 index)	64	No	Atlas: Few-shot Learning with Retrieval Augmente...	2022-08-05	Code
2	Atlas (full, Wiki-dec-2021+CC index)	60.4	No	Atlas: Few-shot Learning with Retrieval Augmente...	2022-08-05	Code
3	DPA-RAG	59.19	No	Understand What LLM Needs: Dual Preference Align...	2024-06-26	Code
4	FiE	58.4	No	0.8% Nyquist computational ghost imaging via non...	2021-08-17	-
5	R2-D2 (full)	55.9	No	R2-D2: A Modular Baseline for Open-Domain Questi...	2021-09-08	Code
6	ReAtt	54.7	No	Retrieval as Attention: End-to-end Learning of R...	2022-12-05	Code
7	FiD-KD (full)	54.7	No	Leveraging Passage Retrieval with Generative Mod...	2020-07-02	Code
8	RankRAG-llama3-70b (Zero-Shot, KILT)	54.2	Yes	RankRAG: Unifying Context Ranking with Retrieval...	2024-07-02	-
9	EMDR^2	52.5	No	End-to-End Training of Multi-Document Reader and...	2021-06-09	Code
10	FID (full)	51.4	No	Leveraging Passage Retrieval with Generative Mod...	2020-07-02	Code
11	RankRAG-llama3-8b (Zero-Shot, KILT)	50.6	Yes	RankRAG: Unifying Context Ranking with Retrieval...	2024-07-02	-
12	RankRAG-llama3-70b (Zero-Shot, DPR)	50	Yes	RankRAG: Unifying Context Ranking with Retrieval...	2024-07-02	-
13	ChatQA-1.5-llama3-70b (Zero-Shot, KILT)	47	Yes	ChatQA: Surpassing GPT-4 on Conversational QA an...	2024-01-18	-
14	RankRAG-llama3-8b (Zero-Shot, DPR)	46.1	Yes	RankRAG: Unifying Context Ranking with Retrieval...	2024-07-02	-
15	RETRO + DPR (full)	45.5	No	Improving language models by retrieving from tri...	2021-12-08	Code
16	code-davinci-002 175B + REPLUG LSR (few-shot)	45.5	No	REPLUG: Retrieval-Augmented Black-Box Language M...	2023-01-30	Code
17	Atlas (few-shot, k=64, Wiki-Dec-2018 index)	45.1	No	Atlas: Few-shot Learning with Retrieval Augmente...	2022-08-05	Code
18	code-davinci-002 175B + REPLUG (few-shot)	44.7	No	REPLUG: Retrieval-Augmented Black-Box Language M...	2023-01-30	Code
19	RAG	44.5	No	Retrieval-Augmented Generation for Knowledge-Int...	2020-05-22	Code
20	ChatQA-1.5-llama3-8b (Zero-Shot, KILT)	42.7	Yes	ChatQA: Surpassing GPT-4 on Conversational QA an...	2024-01-18	-
21	Blended RAG	42.63	No	Blended RAG: Improving RAG (Retriever-Augmented ...	2024-03-22	Code
22	Atlas (few-shot, k=64, Wiki-dec-2021+CC index)	42.4	No	Atlas: Few-shot Learning with Retrieval Augmente...	2022-08-05	Code
23	DPR	41.5	No	Dense Passage Retrieval for Open-Domain Question...	2020-04-10	Code
24	REALM	40.4	No	REALM: Retrieval-Augmented Language Model Pre-Tr...	2020-02-10	Code
25	LLaMA 65B (few-shot, k=64)	39.9	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
26	PaLM-540B (Few-Shot, k=64)	39.6	No	PaLM: Scaling Language Modeling with Pathways	2022-04-05	Code
27	PaLM 2-L (one-shot)	37.5	No	PaLM 2 Technical Report	2023-05-17	Code
28	Chinchilla (few-shot, k=64)	35.5	No	Training Compute-Optimal Large Language Models	2022-03-29	Code
29	LLaMA 65B (few-shot, k=5)	35	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
30	Search-o1	34	No	Search-o1: Agentic Search-Enhanced Large Reasoni...	2025-01-09	Code
31	LLaMA 2 70B (one-shot)	33	No	Llama 2: Open Foundation and Fine-Tuned Chat Mod...	2023-07-18	Code
32	GLaM 62B/64E (Few-Shot)	32.5	No	GLaM: Efficient Scaling of Language Models with ...	2021-12-13	-
33	PaLM 2-M (one-shot)	32	No	PaLM 2 Technical Report	2023-05-17	Code
34	LLaMA 65B (one-shot)	31	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
35	GPT-3 175B (Few-Shot, k=64)	29.9	No	Language Models are Few-Shot Learners	2020-05-28	Code
36	PaLM-540B (One-Shot)	29.3	No	PaLM: Scaling Language Modeling with Pathways	2022-04-05	Code
37	Mistral 7B (5-shot)	28.8	No	Mistral 7B	2023-10-10	Code
38	Gopher (few-shot, k=64)	28.2	No	Scaling Language Models: Methods, Analysis & Ins...	2021-12-08	Code
39	GLaM 62B/64E (One-Shot)	26.3	No	GLaM: Efficient Scaling of Language Models with ...	2021-12-13	-
40	LLaMA 7B (Contriever)	26.07	No	-	-	-
41	PaLM 2-S (one-shot)	25.3	No	PaLM 2 Technical Report	2023-05-17	Code
42	LLaMA 33B (zero-shot)	24.9	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
43	GLaM 62B/64E (Zero-Shot)	24.7	No	GLaM: Efficient Scaling of Language Models with ...	2021-12-13	-
44	PaLM-540B (Zero-Shot)	21.2	No	PaLM: Scaling Language Modeling with Pathways	2022-04-05	Code
45	Neo-6B (QA)	19.7	No	Ask Me Anything: A simple strategy for prompting...	2022-10-05	Code
46	Neo-6B (QA + WS)	19.6	No	Ask Me Anything: A simple strategy for prompting...	2022-10-05	Code
47	Neo-6B (Few-Shot)	13.7	No	Ask Me Anything: A simple strategy for prompting...	2022-10-05	Code

#1Atlas (full, Wiki-dec-2018 index)SOTA
64
EM· 2022-08-05
Atlas: Few-shot Learning with Retrieval Augmented Language Models Code
#2Atlas (full, Wiki-dec-2021+CC index)
60.4
EM· 2022-08-05
Atlas: Few-shot Learning with Retrieval Augmented Language Models Code
#3DPA-RAG
59.19
EM· 2024-06-26
Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation Code
#4FiESOTA
58.4
EM· 2021-08-17
0.8% Nyquist computational ghost imaging via non-experimental deep learning
#5R2-D2 (full)
55.9
EM· 2021-09-08
R2-D2: A Modular Baseline for Open-Domain Question Answering Code
#6ReAtt
54.7
EM· 2022-12-05
Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer Code
#7FiD-KD (full)SOTA
54.7
EM· 2020-07-02
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering Code
#8RankRAG-llama3-70b (Zero-Shot, KILT)
54.2
EM· Extra Data· 2024-07-02
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
#9EMDR^2
52.5
EM· 2021-06-09
End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering Code
#10FID (full)
51.4
EM· 2020-07-02
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering Code
#11RankRAG-llama3-8b (Zero-Shot, KILT)
50.6
EM· Extra Data· 2024-07-02
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
#12RankRAG-llama3-70b (Zero-Shot, DPR)
50
EM· Extra Data· 2024-07-02
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
#13ChatQA-1.5-llama3-70b (Zero-Shot, KILT)
47
EM· Extra Data· 2024-01-18
ChatQA: Surpassing GPT-4 on Conversational QA and RAG
#14RankRAG-llama3-8b (Zero-Shot, DPR)
46.1
EM· Extra Data· 2024-07-02
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
#15RETRO + DPR (full)
45.5
EM· 2021-12-08
Improving language models by retrieving from trillions of tokens Code
#16code-davinci-002 175B + REPLUG LSR (few-shot)
45.5
EM· 2023-01-30
REPLUG: Retrieval-Augmented Black-Box Language Models Code
#17Atlas (few-shot, k=64, Wiki-Dec-2018 index)
45.1
EM· 2022-08-05
Atlas: Few-shot Learning with Retrieval Augmented Language Models Code
#18code-davinci-002 175B + REPLUG (few-shot)
44.7
EM· 2023-01-30
REPLUG: Retrieval-Augmented Black-Box Language Models Code
#19RAGSOTA
44.5
EM· 2020-05-22
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Code
#20ChatQA-1.5-llama3-8b (Zero-Shot, KILT)
42.7
EM· Extra Data· 2024-01-18
ChatQA: Surpassing GPT-4 on Conversational QA and RAG
#21Blended RAG
42.63
EM· 2024-03-22
Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers Code
#22Atlas (few-shot, k=64, Wiki-dec-2021+CC index)
42.4
EM· 2022-08-05
Atlas: Few-shot Learning with Retrieval Augmented Language Models Code
#23DPRSOTA
41.5
EM· 2020-04-10
Dense Passage Retrieval for Open-Domain Question Answering Code
#24REALMSOTA
40.4
EM· 2020-02-10
REALM: Retrieval-Augmented Language Model Pre-Training Code
#25LLaMA 65B (few-shot, k=64)
39.9
EM· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#26PaLM-540B (Few-Shot, k=64)
39.6
EM· 2022-04-05
PaLM: Scaling Language Modeling with Pathways Code
#27PaLM 2-L (one-shot)
37.5
EM· 2023-05-17
PaLM 2 Technical Report Code
#28Chinchilla (few-shot, k=64)
35.5
EM· 2022-03-29
Training Compute-Optimal Large Language Models Code
#29LLaMA 65B (few-shot, k=5)
35
EM· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#30Search-o1
34
EM· 2025-01-09
Search-o1: Agentic Search-Enhanced Large Reasoning Models Code
#31LLaMA 2 70B (one-shot)
33
EM· 2023-07-18
Llama 2: Open Foundation and Fine-Tuned Chat Models Code
#32GLaM 62B/64E (Few-Shot)
32.5
EM· 2021-12-13
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
#33PaLM 2-M (one-shot)
32
EM· 2023-05-17
PaLM 2 Technical Report Code
#34LLaMA 65B (one-shot)
31
EM· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#35GPT-3 175B (Few-Shot, k=64)
29.9
EM· 2020-05-28
Language Models are Few-Shot Learners Code
#36PaLM-540B (One-Shot)
29.3
EM· 2022-04-05
PaLM: Scaling Language Modeling with Pathways Code
#37Mistral 7B (5-shot)
28.8
EM· 2023-10-10
Mistral 7B Code
#38Gopher (few-shot, k=64)
28.2
EM· 2021-12-08
Scaling Language Models: Methods, Analysis & Insights from Training Gopher Code
#39GLaM 62B/64E (One-Shot)
26.3
EM· 2021-12-13
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
#40LLaMA 7B (Contriever)
26.07
EM
No paper
#41PaLM 2-S (one-shot)
25.3
EM· 2023-05-17
PaLM 2 Technical Report Code
#42LLaMA 33B (zero-shot)
24.9
EM· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models Code
#43GLaM 62B/64E (Zero-Shot)
24.7
EM· 2021-12-13
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
#44PaLM-540B (Zero-Shot)
21.2
EM· 2022-04-05
PaLM: Scaling Language Modeling with Pathways Code
#45Neo-6B (QA)
19.7
EM· 2022-10-05
Ask Me Anything: A simple strategy for prompting language models Code
#46Neo-6B (QA + WS)
19.6
EM· 2022-10-05
Ask Me Anything: A simple strategy for prompting language models Code
#47Neo-6B (Few-Shot)
13.7
EM· 2022-10-05
Ask Me Anything: A simple strategy for prompting language models Code