A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts
Bedrick, Steven, Doğruöz, A. Seza, Nisioi, Sergiu
–arXiv.org Artificial Intelligence
Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.
arXiv.org Artificial Intelligence
Nov-20-2025
- Country:
- Africa > Kenya (0.04)
- Asia
- Europe
- Romania > București - Ilfov Development Region
- Municipality of Bucharest > Bucharest (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Finland > Uusimaa
- Helsinki (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.28)
- Oxfordshire > Oxford (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.05)
- Italy > Piedmont
- Turin Province > Turin (0.04)
- Greece > Central Macedonia
- Thessaloniki (0.04)
- Romania > București - Ilfov Development Region
- North America
- Canada > Ontario
- Toronto (0.05)
- United States
- District of Columbia > Washington (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Massachusetts
- Middlesex County > Cambridge (0.04)
- Suffolk County > Boston (0.04)
- New York > New York County
- New York City (0.04)
- Oregon (0.04)
- Canada > Ontario
- Genre:
- Overview (1.00)
- Industry:
- Health & Medicine
- Consumer Health (0.67)
- Health Care Providers & Services (0.67)
- Health Care Technology > Medical Record (0.47)
- Therapeutic Area (1.00)
- Health & Medicine
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.68)
- Natural Language
- Chatbot (0.68)
- Large Language Model (0.96)
- Machine Translation (0.93)
- Text Processing (0.67)
- Representation & Reasoning (1.00)
- Machine Learning > Neural Networks
- Data Science > Data Mining (1.00)
- Artificial Intelligence
- Information Technology