Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Cai, Qifeng, Liang, Hao, Xu, Chang, Xie, Tao, Zhang, Wentao, Cui, Bin

arXiv.org Artificial Intelligence 

Abstract--The data-centric paradigm has emerged as a pivotal direction in artificial intelligence (AI), relying on high-quality training data. This shift is especially critical in the T ext-to-SQL task, where model performance is constrained by the scarcity, limited diversity, and structural simplicity of existing datasets. Our framework operates along six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language (NL) question generation, chain-of-thought (CoT) reasoning trace generation, and data classification. A modular Database Manager further ensures cross-database compatibility and scalability. This approach enables structure-aware example matching by modeling fine-grained alignments between NL questions and SQL queries. Our work establishes a scalable, data-centric foundation for advancing T ext-to-SQL systems and underscores the indispensable role of structured, high-fidelity data in modern AI development. Our code is available at https://github.com/T In recent years, the data-centric artificial intelligence (AI) paradigm has garnered increasing attention [1], [2]. Traditional algorithm-centric approaches primarily focus on expanding model architectures and optimizing learning algorithms. However, in many cutting-edge fields, the main bottleneck of development has gradually shifted from algorithmic complexity to the availability of high-quality data. Continuous optimization of algorithms is facing diminishing marginal returns, while vast amounts of data remain underutilized, containing immense potential value. Taking large language models (LLMs) as an example, their generalization ability and robustness highly depend on the breadth and quality of the training data. Similarly, in downstream tasks such as domain adaptation, high-quality data can serve both as reference material for generating answers and as guidance for solving problems [4].