CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset

Zhang, Hanchong, Li, Jieyu, Chen, Lu, Cao, Ruisheng, Zhang, Yunyan, Huang, Yu, Zheng, Yefeng, Yu, Kai

May-25-2023–arXiv.org Artificial Intelligence

The cross-domain text-to-SQL task aims to build a system that can parse user questions into SQL on complete unseen databases, and the single-domain text-to-SQL task evaluates the performance on identical databases. Both of these setups confront unavoidable difficulties in real-world applications. To this end, we introduce the cross-schema text-to-SQL task, where the databases of evaluation data are different from that in the training data but come from the same domain. Furthermore, we present CSS, a large-scale CrosS-Schema Chinese text-to-SQL dataset, to carry on corresponding studies. CSS originally consisted of 4,340 question/SQL pairs across 2 databases. In order to generalize models to different medical systems, we extend CSS and create 19 new databases along with 29,280 corresponding dataset examples. Moreover, CSS is also a large corpus for single-domain Chinese text-to-SQL studies. We present the data collection approach and a series of analyses of the data statistics. To show the potential and usefulness of CSS, benchmarking baselines have been conducted and reported. Our dataset is publicly available at \url{https://huggingface.co/datasets/zhanghanchong/css}.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

May-25-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - United States
    - Pennsylvania (0.04)
    - New Jersey (0.04)
    - Michigan (0.04)
    - New York > New York County
      - New York City (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Ireland (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
  - India > Maharashtra
    - Mumbai (0.04)
  - China
    - Shanghai > Shanghai (0.04)
    - Hong Kong (0.04)
    - Guangdong Province > Shenzhen (0.04)

Genre:
- Research Report (0.50)

Industry:
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.68)

Technology:
- Information Technology
  - Databases (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Machine Learning (1.00)
    - Representation & Reasoning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found