Semantic Parsing on BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)

Metric: Execution Accuracy % (Dev) (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Execution Accuracy % (Dev)▼	Extra Data	Paper	Date↕	Code
1	DSAIR + GPT-4o	74.32	No	-	-	-
2	XiYan-SQL	73.34	No	A Preview of XiYan-SQL: A Multi-Generator Ensemb...	2024-11-13	Code
3	CHASE-SQL + Gemini	73.14	No	CHASE-SQL: Multi-Path Reasoning and Preference O...	2024-10-02	-
4	ExSL + granite-34b-code	72.43	No	-	-	-
5	Insights AI	72.16	No	-	-	-
6	OpenSearch-SQL+ v2 + GPT-4o	69.3	No	-	-	-
7	MCTS-SQL	68.91	No	-	-	-
8	PURPLE + RED + GPT-4o	68.12	No	-	-	-
9	Arcwise + GPT-4o	67.99	No	-	-	-
10	Distillery + GPT-4o	67.21	No	The Death of Schema Linking? Text-to-SQL in the ...	2024-08-14	-
11	RECAP + Gemini	66.95	No	-	-	-
12	MSL-SQL + DeepSeek-V2.5	66.82	No	-	-	-
13	MSc-SQL	65.6	No	MSc-SQL: Multi-Sample Critiquing Small Language ...	2024-10-16	Code
14	ByteBrain	65.45	No	-	-	-
15	ExSL + granite-20b-code	65.38	No	-	-	-
16	CHESS	65	No	CHESS: Contextual Harnessing for Efficient SQL S...	2024-05-27	Code
17	SCL-SQL	64.73	No	-	-	-
18	SFT CodeS-15B + SQLFixAgent	64.62	No	-	-	-
19	MCS-SQL + GPT-4	63.36	No	-	-	-
20	PURPLE + GPT-4o	62.97	No	-	-	-
21	GRA-SQL	62.58	No	-	-	-
22	OpenSearch-SQL v1 + GPT-4	61.34	No	-	-	-
23	PB-SQL v1	60.5	No	-	-	-
24	Dubo-SQL, v1	59.71	No	-	-	-
25	SuperSQL	58.5	No	-	-	-
26	SFT CodeS-15B	58.47	No	-	-	-
27	MAC-SQL + GPT-4	57.56	No	MAC-SQL: A Multi-Agent Collaborative Framework f...	2023-12-18	Code
28	SFT CodeS-7B	57.17	No	-	-	-
29	SENSE-13B	55.48	No	-	-	-
30	SENSE	55.48	No	-	-	-
31	DAIL-SQL + GPT-4	54.76	No	Text-to-SQL Empowered by Large Language Models: ...	2023-08-29	Code
32	DIN-SQL + GPT-4	50.72	No	DIN-SQL: Decomposed In-Context Learning of Text-...	2023-04-21	Code
33	DELLM + MAC-SQL	48.92	No	Knowledge-to-SQL: Enhancing SQL Generation with ...	2024-02-18	Code
34	GPT-4 (Baseline)	46.35	No	Can LLMs Effectively Leverage Graph Structural I...	2023-09-28	Code
35	Claude-2 (Baseline)	42.7	No	Can LLMs Effectively Leverage Graph Structural I...	2023-09-28	Code
36	Open SQL-7B	37.68	No	-	-	-
37	ChatGPT (Baseline)	37.22	No	Can LLM Already Serve as A Database Interface? A...	2023-05-04	Code
38	CoT + ChatGPT	36.64	No	Can LLM Already Serve as A Database Interface? A...	2023-05-04	Code
39	Codex (Baseline)	34.35	No	Can LLM Already Serve as A Database Interface? A...	2023-05-04	Code
40	Palm-2 (Baseline)	27.38	No	Can LLM Already Serve as A Database Interface? A...	2023-05-04	Code

#1DSAIR + GPT-4o
74.32
Execution Accuracy % (Dev)
No paper
#2XiYan-SQLSOTA
73.34
Execution Accuracy % (Dev)· 2024-11-13
A Preview of XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL Code
#3CHASE-SQL + GeminiSOTA
73.14
Execution Accuracy % (Dev)· 2024-10-02
CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL
#4ExSL + granite-34b-code
72.43
Execution Accuracy % (Dev)
No paper
#5Insights AI
72.16
Execution Accuracy % (Dev)
No paper
#6OpenSearch-SQL+ v2 + GPT-4o
69.3
Execution Accuracy % (Dev)
No paper
#7MCTS-SQL
68.91
Execution Accuracy % (Dev)
No paper
#8PURPLE + RED + GPT-4o
68.12
Execution Accuracy % (Dev)
No paper
#9Arcwise + GPT-4o
67.99
Execution Accuracy % (Dev)
No paper
#10Distillery + GPT-4oSOTA
67.21
Execution Accuracy % (Dev)· 2024-08-14
The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models
#11RECAP + Gemini
66.95
Execution Accuracy % (Dev)
No paper
#12MSL-SQL + DeepSeek-V2.5
66.82
Execution Accuracy % (Dev)
No paper
#13MSc-SQL
65.6
Execution Accuracy % (Dev)· 2024-10-16
MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation Code
#14ByteBrain
65.45
Execution Accuracy % (Dev)
No paper
#15ExSL + granite-20b-code
65.38
Execution Accuracy % (Dev)
No paper
#16CHESSSOTA
65
Execution Accuracy % (Dev)· 2024-05-27
CHESS: Contextual Harnessing for Efficient SQL Synthesis Code
#17SCL-SQL
64.73
Execution Accuracy % (Dev)
No paper
#18SFT CodeS-15B + SQLFixAgent
64.62
Execution Accuracy % (Dev)
No paper
#19MCS-SQL + GPT-4
63.36
Execution Accuracy % (Dev)
No paper
#20PURPLE + GPT-4o
62.97
Execution Accuracy % (Dev)
No paper
#21GRA-SQL
62.58
Execution Accuracy % (Dev)
No paper
#22OpenSearch-SQL v1 + GPT-4
61.34
Execution Accuracy % (Dev)
No paper
#23PB-SQL v1
60.5
Execution Accuracy % (Dev)
No paper
#24Dubo-SQL, v1
59.71
Execution Accuracy % (Dev)
No paper
#25SuperSQL
58.5
Execution Accuracy % (Dev)
No paper
#26SFT CodeS-15B
58.47
Execution Accuracy % (Dev)
No paper
#27MAC-SQL + GPT-4SOTA
57.56
Execution Accuracy % (Dev)· 2023-12-18
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL Code
#28SFT CodeS-7B
57.17
Execution Accuracy % (Dev)
No paper
#29SENSE-13B
55.48
Execution Accuracy % (Dev)
No paper
#30SENSE
55.48
Execution Accuracy % (Dev)
No paper
#31DAIL-SQL + GPT-4SOTA
54.76
Execution Accuracy % (Dev)· 2023-08-29
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation Code
#32DIN-SQL + GPT-4SOTA
50.72
Execution Accuracy % (Dev)· 2023-04-21
DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction Code
#33DELLM + MAC-SQL
48.92
Execution Accuracy % (Dev)· 2024-02-18
Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM Code
#34GPT-4 (Baseline)
46.35
Execution Accuracy % (Dev)· 2023-09-28
Can LLMs Effectively Leverage Graph Structural Information through Prompts, and Why?Code
#35Claude-2 (Baseline)
42.7
Execution Accuracy % (Dev)· 2023-09-28
Can LLMs Effectively Leverage Graph Structural Information through Prompts, and Why?Code
#36Open SQL-7B
37.68
Execution Accuracy % (Dev)
No paper
#37ChatGPT (Baseline)
37.22
Execution Accuracy % (Dev)· 2023-05-04
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs Code
#38CoT + ChatGPT
36.64
Execution Accuracy % (Dev)· 2023-05-04
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs Code
#39Codex (Baseline)
34.35
Execution Accuracy % (Dev)· 2023-05-04
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs Code
#40Palm-2 (Baseline)
27.38
Execution Accuracy % (Dev)· 2023-05-04
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs Code