CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
Zhang, Hanchong, Li, Jieyu, Chen, Lu, Cao, Ruisheng, Zhang, Yunyan, Huang, Yu, Zheng, Yefeng, Yu, Kai
–arXiv.org Artificial Intelligence
The cross-domain text-to-SQL task aims to build a system that can parse user questions into SQL on complete unseen databases, and the single-domain text-to-SQL task evaluates the performance on identical databases. Both of these setups confront unavoidable difficulties in real-world applications. To this end, we introduce the cross-schema text-to-SQL task, where the databases of evaluation data are different from that in the training data but come from the same domain. Furthermore, we present CSS, a large-scale CrosS-Schema Chinese text-to-SQL dataset, to carry on corresponding studies. CSS originally consisted of 4,340 question/SQL pairs across 2 databases. In order to generalize models to different medical systems, we extend CSS and create 19 new databases along with 29,280 corresponding dataset examples. Moreover, CSS is also a large corpus for single-domain Chinese text-to-SQL studies. We present the data collection approach and a series of analyses of the data statistics. To show the potential and usefulness of CSS, benchmarking baselines have been conducted and reported. Our dataset is publicly available at \url{https://huggingface.co/datasets/zhanghanchong/css}.
arXiv.org Artificial Intelligence
May-25-2023
- Country:
- Asia > China (0.47)
- Europe (0.93)
- North America > United States (0.68)
- Genre:
- Research Report (0.50)
- Industry:
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.68)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Representation & Reasoning (0.93)
- Databases (1.00)
- Artificial Intelligence
- Information Technology