nl question
Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL
Cai, Qifeng, Liang, Hao, Xu, Chang, Xie, Tao, Zhang, Wentao, Cui, Bin
Abstract--The data-centric paradigm has emerged as a pivotal direction in artificial intelligence (AI), relying on high-quality training data. This shift is especially critical in the T ext-to-SQL task, where model performance is constrained by the scarcity, limited diversity, and structural simplicity of existing datasets. Our framework operates along six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language (NL) question generation, chain-of-thought (CoT) reasoning trace generation, and data classification. A modular Database Manager further ensures cross-database compatibility and scalability. This approach enables structure-aware example matching by modeling fine-grained alignments between NL questions and SQL queries. Our work establishes a scalable, data-centric foundation for advancing T ext-to-SQL systems and underscores the indispensable role of structured, high-fidelity data in modern AI development. Our code is available at https://github.com/T In recent years, the data-centric artificial intelligence (AI) paradigm has garnered increasing attention [1], [2]. Traditional algorithm-centric approaches primarily focus on expanding model architectures and optimizing learning algorithms. However, in many cutting-edge fields, the main bottleneck of development has gradually shifted from algorithmic complexity to the availability of high-quality data. Continuous optimization of algorithms is facing diminishing marginal returns, while vast amounts of data remain underutilized, containing immense potential value. Taking large language models (LLMs) as an example, their generalization ability and robustness highly depend on the breadth and quality of the training data. Similarly, in downstream tasks such as domain adaptation, high-quality data can serve both as reference material for generating answers and as guidance for solving problems [4].
Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints
Xiong, Raymond M., Chen, Panyu, Dong, Tianze, Lu, Jian, Goldstein, Benjamin, Zhuo, Danyang, Zhang, Anru R.
Electronic health records (EHRs) are central to modern healthcare delivery and research; yet, many researchers lack the database expertise necessary to write complex SQL queries or generate effective visualizations, limiting efficient data use and scientific discovery. To address this barrier, we introduce CELEC, a large language model (LLM)-powered framework for automated EHR data extraction and analytics. CELEC translates natural language queries into SQL using a prompting strategy that integrates schema information, few-shot demonstrations, and chain-of-thought reasoning, which together improve accuracy and robustness. On a subset of the EHRSQL benchmark, CELEC achieves execution accuracy comparable to prior systems while maintaining low latency, cost efficiency, and strict privacy by exposing only database metadata to the LLM. CELEC also adheres to strict privacy protocols: the LLM accesses only database metadata (e.g., table and column names), while all query execution occurs securely within the institutional environment, ensuring that no patient-level data is ever transmitted to or shared with the LLM. Ablation studies confirm that each component of the SQL generation pipeline, particularly the few-shot demonstrations, plays a critical role in performance. By lowering technical barriers and enabling medical researchers to query EHR databases directly, CELEC streamlines research workflows and accelerates biomedical discovery.
MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction
Abedini, Sepideh, Mohapatra, Shubhankar, Emerson, D. B., Shafieinejad, Masoumeh, Cresswell, Jesse C., He, Xi
Large language models (LLMs) have shown promising performance on tasks that require reasoning, such as text-to-SQL, code generation, and debugging. However, regulatory frameworks with strict privacy requirements constrain their integration into sensitive systems. State-of-the-art LLMs are also proprietary, costly, and resource-intensive, making local deployment impractical. Consequently, utilizing such LLMs often requires sharing data with third-party providers, raising privacy concerns and risking noncompliance with regulations. Although fine-tuned small language models (SLMs) can outperform LLMs on certain tasks and be deployed locally to mitigate privacy concerns, they underperform on more complex tasks such as text-to-SQL translation. In this work, we introduce MaskSQL, a text-to-SQL framework that utilizes abstraction as a privacy protection mechanism to mask sensitive information in LLM prompts. Unlike redaction, which removes content entirely, or generalization, which broadens tokens, abstraction retains essential information while discarding unnecessary details, striking an effective privacy-utility balance for the text-to-SQL task. Moreover, by providing mechanisms to control the privacy-utility tradeoff, MaskSQL facilitates adoption across a broader range of use cases. Our experimental results show that MaskSQL outperforms leading SLM-based text-to-SQL models and achieves performance approaching state-of-the-art LLM-based models, while preserving privacy.
SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection
Gashkov, Aleksandr, Perevalov, Aleksandr, Eltsova, Maria, Both, Andreas
Nowadays, the importance of software with natural-language user interfaces cannot be underestimated. In particular, in Question Answering (QA) systems, generating a SPARQL query for a given natural-language question (often named Query Building) from the information retrieved from the same question is the central task of QA systems working over Knowledge Graphs (KGQA). Due to the rise of Large Language Models (LLMs), they are considered a well-suited method to increase the quality of the question-answering functionality, as there is still a lot of room for improvement, aiming for enhanced quality and trustworthiness. However, LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data. In this paper, we introduce a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question under various conditions: (1) zero-shot SPARQL generation, (2) with knowledge injection, and (3) with "anonymized" knowledge injection. This enables us, for the first time, to estimate the influence of the training data on the QA quality improved by LLMs. Ultimately, this will help to identify how portable a method is or whether good results might mostly be achieved because a benchmark was already included in the training data (cf. LLM memorization). The developed method is portable, robust, and supports any knowledge graph; therefore, it could be easily applied to any KGQA or LLM, s.t., generating consistent insights into the actual LLM capabilities is possible.
Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation
Tian, Yuan, Lee, Daniel, Wu, Fei, Mai, Tung, Qian, Kun, Sahai, Siddhartha, Zhang, Tianyi, Li, Yunyao
Text-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-SQL models experience significant performance drops when applied to new schemas, primarily due to the lack of domain-specific data for fine-tuning. This data scarcity also limits the ability to effectively evaluate model performance in new domains. Continuously obtaining high-quality text-to-SQL data for evolving schemas is prohibitively expensive in real-world scenarios. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through human-LLM collaboration in a structured workflow. A within-subjects user study comparing SQLsynth with manual annotation and ChatGPT shows that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/adobe/nl_sql_analyzer.
Text-to-SQL based on Large Language Models and Database Keyword Search
Nascimento, Eduardo R., Avila, Caio Viktor S., Izquierdo, Yenier T., Garcรญa, Grettel M., Andrade, Lucas Feijรณ L., Facina, Michelle S. P., Lemos, Melissa, Casanova, Marco A.
Text-to-SQL prompt strategies based on Large Language Models (LLMs) achieve remarkable performance on well-known benchmarks. However, when applied to real-world databases, their performance is significantly less than for these benchmarks, especially for Natural Language (NL) questions requiring complex filters and joins to be processed. This paper then proposes a strategy to compile NL questions into SQL queries that incorporates a dynamic few-shot examples strategy and leverages the services provided by a database keyword search (KwS) platform. The paper details how the precision and recall of the schema-linking process are improved with the help of the examples provided and the keyword-matching service that the KwS platform offers. Then, it shows how the KwS platform can be used to synthesize a view that captures the joins required to process an input NL question and thereby simplify the SQL query compilation step. The paper includes experiments with a real-world relational database to assess the performance of the proposed strategy. The experiments suggest that the strategy achieves an accuracy on the real-world relational database that surpasses state-of-the-art approaches. The paper concludes by discussing the results obtained.
DataVisT5: A Pre-trained Language Model for Jointly Understanding Text and Data Visualization
Wan, Zhuoyue, Song, Yuanfeng, Li, Shuaimin, Zhang, Chen Jason, Wong, Raymond Chi-Wing
Data visualization (DV) is the fundamental and premise tool to improve the efficiency in conveying the insights behind the big data, which has been widely accepted in existing data-driven world. Task automation in DV, such as converting natural language queries to visualizations (i.e., text-to-vis), generating explanations from visualizations (i.e., vis-to-text), answering DV-related questions in free form (i.e. FeVisQA), and explicating tabular data (i.e., table-to-text), is vital for advancing the field. Despite their potential, the application of pre-trained language models (PLMs) like T5 and BERT in DV has been limited by high costs and challenges in handling cross-modal information, leading to few studies on PLMs for DV. We introduce \textbf{DataVisT5}, a novel PLM tailored for DV that enhances the T5 architecture through a hybrid objective pre-training and multi-task fine-tuning strategy, integrating text and DV datasets to effectively interpret cross-modal semantics. Extensive evaluations on public datasets show that DataVisT5 consistently outperforms current state-of-the-art models on various DV-related tasks. We anticipate that DataVisT5 will not only inspire further research on vertical PLMs but also expand the range of applications for PLMs.
Making LLMs Work for Enterprise Data Tasks
Demiralp, รaฤatay, Wenz, Fabian, Chen, Peter Baile, Kayali, Moe, Tatbul, Nesime, Stonebraker, Michael
Intel Large language models (LLMs) have shown strong performances on natural language (NL) comprehension tasks, from summarization to question answering. The power of these models comes from optimizing for simple self-supervised learning tasks such as next token prediction using massive public web texts as training data on a scalable and adaptive architecture. However, by construction, LLMs know little about enterprise database tables in the private data ecosystem, which differ substantially from web text in structure and content. Given LLMs' performance is tied to their training data [1], a crucial question is how useful they can be in improving enterprise database management and analysis tasks. To help contend with this question, we contribute (1) preliminary experimental results on the performance of LLMs for text-to-SQL and semantic column-type detection tasks on enterprise datasets and (2) a discussion of challenges and potential solutions for effectively utilizing LLMs in enterprise settings.
TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring
Lee, Gyubok, Chay, Woosog, Cho, Seonhee, Choi, Edward
Text-to-SQL enables users to interact with databases using natural language, simplifying the retrieval and synthesis of information. Despite the remarkable success of large language models (LLMs) in translating natural language questions into SQL queries, widespread deployment remains limited due to two primary challenges. First, the effective use of text-to-SQL models depends on users' understanding of the model's capabilities-the scope of questions the model can correctly answer. Second, the absence of abstention mechanisms can lead to incorrect SQL generation going unnoticed, thereby undermining trust in the model's output. To enable wider deployment, it is crucial to address these challenges in model design and enhance model evaluation to build trust in the model's output. To this end, we introduce TrustSQL, a novel comprehensive benchmark designed to evaluate text-to-SQL reliability-defined as a model's ability to correctly handle any type of input question by generating correct SQL queries for feasible questions and abstaining from generating infeasible ones (e.g., due to schema incompatibility or functionalities beyond SQL). We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches: (1) pipeline-based methods combining SQL generators with infeasible question detectors and SQL error detectors for abstention; and (2) unified methods using a single model for the entire task. Our experimental results reveal that achieving high scores under severe penalties requires significant effort and provide a new perspective on developing text-to-SQL models for safer deployment. TrustSQL is available at https://github.com/glee4810/TrustSQL.
Large Language Model for Table Processing: A Survey
Lu, Weizheng, Zhang, Jiaming, Zhang, Jing, Chen, Yueguo
Tables, typically two-dimensional and structured to store large amounts of data, are essential in daily activities like database queries, spreadsheet calculations, and generating reports from web tables. Automating these table-centric tasks with Large Language Models (LLMs) offers significant public benefits, garnering interest from academia and industry. This survey provides an extensive overview of table tasks, encompassing not only the traditional areas like table question answering (Table QA) and fact verification, but also newly emphasized aspects such as table manipulation and advanced table data analysis. Additionally, it goes beyond the early strategies of pre-training and fine-tuning small language models, to include recent paradigms in LLM usage. The focus here is particularly on instruction-tuning, prompting, and agent-based approaches within the realm of LLMs. Finally, we highlight several challenges, ranging from private deployment and efficient inference to the development of extensive benchmarks for table manipulation and advanced data analysis.