Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing
Patwardhan, Manasi, Agarwal, Ayush, Bhaisaheb, Shabbirhussain, Arora, Aseem, Vig, Lovekesh, Sarawagi, Sunita
–arXiv.org Artificial Intelligence
Abstract--The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches. The impressive natural language understanding and code generation capabilities of modern LLMs has led to significantly improved performance on NL-SQL semantic parsing [1], [2]. However, their accuracy varies widely with the database queried [3]. DBs in WikiSQL [4] or Spider [5] contain semantically meaningful table/column names and cell values making it easier for LLMs to accurately link domain expressions in the NL query with the DB schema/cell elements.
arXiv.org Artificial Intelligence
Oct-6-2025
- Country:
- Asia
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- California > Amador County (0.04)
- Washington > King County
- Seattle (0.04)
- Canada > Ontario
- Genre:
- Research Report (0.50)
- Industry:
- Education (0.46)
- Health & Medicine > Therapeutic Area (0.30)
- Leisure & Entertainment (0.48)
- Technology: