Anonymization by Design of Language Modeling
Boutet, Antoine, Kazdam, Zakaria El, Magnana, Lucas, Zimmermann, Helain
–arXiv.org Artificial Intelligence
However, these advances Johnson et al. proposed to use a neural network based on a BERT raise significant privacy concerns, especially when models architecture [15] to detect a number of identifying elements in medical specialized on sensitive data can memorize and then expose and documents. More recently, different hospitals have also explored regurgitate confidential information. This paper presents a privacyby-design the feasibility of using NLP models to automatically pseudonymize language modeling approach to address the problem text documents (i.e., hide specific direct identifiers named Personally of language models anonymization, and thus promote their sharing. Identifiable Information (PII)) from their clinical data warehouse Specifically, we propose both a Masking Language Modeling [35, 45]. In these approaches, the BERT model is fine-tuned (MLM) methodology to specialize a BERT-like language model, and with the medical reports from the hospital (in order to specialize and a Causal Language Modeling (CLM) methodology to specialize a well understand the reports generated by the local practitioners) GPT-like model that avoids the model from memorizing direct and before training a Named Entity Recognition on a set of Personally indirect identifying information present in the training data. We Identifiable Information that directly identify patients.
arXiv.org Artificial Intelligence
Jan-4-2025
- Genre:
- Research Report > Experimental Study (0.34)
- Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Health Care Technology
- Medical Record (1.00)
- Technology: