Data-Constrained Synthesis of Training Data for De-Identification

Vakili, Thomas, Henriksson, Aron, Dalianis, Hercules

Feb-21-2025–arXiv.org Artificial Intelligence

Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

corpora, experiment, synthetic corpora, (15 more...)

arXiv.org Artificial Intelligence

Feb-21-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - Montserrat (0.04)
  - Cuba (0.04)
  - United States > Minnesota
    - Hennepin County > Minneapolis (0.14)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Middle East > Malta (0.04)
  - Germany > Saarland (0.04)
  - Sweden > Stockholm
    - Stockholm (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found