DataGpt-SQL-7B: An Open-Source Language Model for Text-to-SQL

Lixia Wu, Peng Li, Junhong Lou, Lei Fu

2024-09-24Text-To-SQL Natural Language Queries Language Modelling

Abstract

In addressing the pivotal role of translating natural language queries into SQL commands, we propose a suite of compact, fine-tuned models and self-refine mechanisms to democratize data access and analysis for non-expert users, mitigating risks associated with closed-source Large Language Models. Specifically, we constructed a dataset of over 20K sample for Text-to-SQL as well as the preference dateset, to improve the efficiency in the domain of SQL generation. To further ensure code validity, a code corrector was integrated into the model. Our system, DataGpt-sql, achieved 87.2\% accuracy on the spider-dev, respectively, showcasing the effectiveness of our solution in text-to-SQL conversion tasks. Our code, data, and models are available at \url{https://github.com/CainiaoTechAi/datagpt-sql-7b}

Results

Task	Dataset	Metric	Value	Model
Semantic Parsing	spider	Exact Match Accuracy (Dev)	81.6	datagpt-sql-7B + InvalidSQL-Feedback
Semantic Parsing	spider	Execution Accuracy (Dev)	87.2	datagpt-sql-7B + InvalidSQL-Feedback
Semantic Parsing	spider	Exact Match Accuracy (Dev)	80.3	datagpt-sql-7B
Semantic Parsing	spider	Execution Accuracy (Dev)	84.8	datagpt-sql-7B
Text-To-SQL	spider	Exact Match Accuracy (Dev)	81.6	datagpt-sql-7B + InvalidSQL-Feedback
Text-To-SQL	spider	Execution Accuracy (Dev)	87.2	datagpt-sql-7B + InvalidSQL-Feedback
Text-To-SQL	spider	Exact Match Accuracy (Dev)	80.3	datagpt-sql-7B
Text-To-SQL	spider	Execution Accuracy (Dev)	84.8	datagpt-sql-7B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17 Assay2Mol: large language model-based drug design using BioAssay context2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16