AITopics | de-identification

Collaborating Authors

de-identification

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

Prakash, Eva, Attias, Maayane, Chambon, Pierre, Xu, Justin, Truong, Steven, Delbrouck, Jean-Benoit, Cook, Tessa, Langlotz, Curtis

arXiv.org Artificial IntelligenceNov-24-2025

Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.

category, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.04079

Country:

Europe (0.46)
North America > United States > Pennsylvania (0.25)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.50)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss

Aghakasiri, Kiana, Zambare, Noopur, Thai, JoAnn, Ye, Carrie, Mehta, Mayur, Mitchell, J. Ross, Abdalla, Mohamed

arXiv.org Artificial IntelligenceSep-19-2025

De-identification in the healthcare setting is an application of NLP where automated algorithms are used to remove personally identifying information of patients (and, sometimes, providers). With the recent rise of generative large language models (LLMs), there has been a corresponding rise in the number of papers that apply LLMs to de-identification. Although these approaches often report near-perfect results, significant challenges concerning reproducibility and utility of the research papers persist. This paper identifies three key limitations in the current literature: inconsistent reporting metrics hindering direct comparisons, the inadequacy of traditional classification metrics in capturing errors which LLMs may be more prone to (i.e., altering clinically relevant information), and lack of manual validation of automated metrics which aim to quantify these errors. To address these issues, we first present a survey of LLM-based de-identification research, highlighting the heterogeneity in reporting standards. Second, we evaluated a diverse set of models to quantify the extent of inappropriate removal of clinical information. Next, we conduct a manual validation of an existing evaluation metric to measure the removal of clinical information, employing clinical experts to assess their efficacy. We highlight poor performance and describe the inherent limitations of such metrics in identifying clinically significant changes. Lastly, we propose a novel methodology for the detection of clinically relevant information removal.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.14464

Country: North America > Canada > Alberta (0.14)

Genre:

Overview (1.00)
Research Report > New Finding (0.93)
Research Report > Experimental Study (0.67)

Industry:

Information Technology > Security & Privacy (0.93)
Health & Medicine > Therapeutic Area (0.67)
Health & Medicine > Health Care Technology > Medical Record (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)

Add feedback

Beyond De-Identification: A Structured Approach for Defining and Detecting Indirect Identifiers in Medical Texts

Baroud, Ibrahim, Raithel, Lisa, Möller, Sebastian, Roller, Roland

arXiv.org Artificial IntelligenceFeb-18-2025

Sharing sensitive texts for scientific purposes requires appropriate techniques to protect the privacy of patients and healthcare personnel. Anonymizing textual data is particularly challenging due to the presence of diverse unstructured direct and indirect identifiers. To mitigate the risk of re-identification, this work introduces a schema of nine categories of indirect identifiers designed to account for different potential adversaries, including acquaintances, family members and medical staff. Using this schema, we annotate 100 MIMIC-III discharge summaries and propose baseline models for identifying indirect identifiers. We will release the annotation guidelines, annotation spans (6,199 annotations in total) and the corresponding MIMIC-III document IDs to support further research in this area.

category, identifier, information, (15 more...)

arXiv.org Artificial Intelligence

2502.13342

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > Montserrat (0.04)
Europe > Spain (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.64)

Industry:

Law (1.00)
Health & Medicine > Health Care Providers & Services (0.94)
Information Technology > Security & Privacy (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LLMs-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages

Gunay, Murat, Keles, Bunyamin, Hizlan, Raife

arXiv.org Artificial IntelligenceDec-14-2024

The rise of chronic diseases and pandemics like COVID-19 has emphasized the need for effective patient data processing while ensuring privacy through anonymization and de-identification of protected health information (PHI). Anonymized data facilitates research without compromising patient confidentiality. This paper introduces expert small AI models developed using the LLM-in-the-loop methodology to meet the demand for domain-specific de-identification NER models. These models overcome the privacy risks associated with large language models (LLMs) used via APIs by eliminating the need to transmit or store sensitive data. More importantly, they consistently outperform LLMs in de-identification tasks, offering superior performance and reliability. Our de-identification NER models, developed in eight languages (English, German, Italian, French, Romanian, Turkish, Spanish, and Arabic) achieved f1-micro score averages of 0.966, 0.975, 0.976, 0.970, 0.964, 0.974, 0.978, and 0.953 respectively. These results establish them as the most accurate healthcare anonymization solutions, surpassing existing small models and even general-purpose LLMs such as GPT-4o. While Part-1 of this series introduced the LLM-in-the-loop methodology for bio-medical document translation, this second paper showcases its success in developing cost-effective expert small NER models in de-identification tasks. Our findings lay the groundwork for future healthcare AI innovations, including biomedical entity and relation extraction, demonstrating the value of specialized models for domain-specific challenges.

dataset, de-identification, information, (15 more...)

arXiv.org Artificial Intelligence

2412.10918

Country:

North America > United States (1.00)
Europe > United Kingdom > Wales (0.04)
Europe > United Kingdom > Scotland (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Government Relations & Public Policy (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.71)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data

Paul, Angel, Shaji, Dhivin, Han, Lifeng, Del-Pinto, Warren, Nenadic, Goran

arXiv.org Artificial IntelligenceOct-2-2024

De-identification is important in protecting patients' privacy for healthcare text analytics. The MASK framework is one of the best on the de-identification shared task organised by n2c2/i2b2 challenges. This work enhances the MASK framework by integrating ClinicalBERT, a deep learning model specifically fine-tuned on clinical texts, alongside traditional de-identification methods like dictionary lookup and rule-based approaches. The system effectively identifies and either redacts or replaces sensitive identifiable entities within clinical documents, while also allowing users to customise the masked documents according to their specific needs. The integration of ClinicalBERT significantly improves the performance of entity recognition, achieving 0.9732 F1-score, especially for common entities such as names, dates, and locations. A risk assessment feature has also been developed, which analyses the uniqueness of context within documents to classify them into risk levels, guiding further de-identification efforts. While the system demonstrates strong overall performance, this work highlights areas for future improvement, including handling more complex entity occurrences and enhancing the system's adaptability to different clinical settings.

clinical text, clinicalbert, dataset, (14 more...)

arXiv.org Artificial Intelligence

2410.01648

Country:

Europe > Netherlands > South Holland > Leiden (0.05)
North America > United States > Hawaii (0.04)
North America > United States > Massachusetts (0.04)
Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

De-identification is not always enough

Sarkar, Atiquer Rahman, Chuang, Yao-Shun, Mohammed, Noman, Jiang, Xiaoqian

arXiv.org Artificial IntelligenceJan-31-2024

For sharing privacy-sensitive data, de-identification is commonly regarded as adequate for safeguarding privacy. Synthetic data is also being considered as a privacy-preserving alternative. Recent successes with numerical and tabular data generative models and the breakthroughs in large generative language models raise the question of whether synthetically generated clinical notes could be a viable alternative to real notes for research purposes. In this work, we demonstrated that (i) de-identification of real clinical notes does not protect records against a membership inference attack, (ii) proposed a novel approach to generate synthetic clinical notes using the current state-of-the-art large language models, (iii) evaluated the performance of the synthetically generated notes in a clinical domain task, and (iv) proposed a way to mount a membership inference attack where the target model is trained with synthetic data. We observed that when synthetically generated notes closely match the performance of real data, they also exhibit similar privacy concerns to the real data. Whether other approaches to synthetically generated clinical notes could offer better trade-offs and become a better alternative to sensitive real notes warrants further investigation.

classifier, clinical note, membership inference attack, (16 more...)

arXiv.org Artificial Intelligence

2402.00179

Country:

North America > Canada > Ontario (0.04)
North America > United States > Texas (0.04)
North America > Canada > Saskatchewan > Regina (0.04)
North America > Canada > Manitoba > Winnipeg Metropolitan Region > Winnipeg (0.04)

Genre:

Research Report > Promising Solution (0.88)
Research Report > New Finding (0.68)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)

Add feedback

Beyond Accuracy: Automated De-Identification of Large Real-World Clinical Text Datasets

Kocaman, Veysel, Haq, Hasham Ul, Talby, David

arXiv.org Artificial IntelligenceDec-13-2023

Recent research advances achieve human-level accuracy for de-identifying free-text clinical notes on research datasets, but gaps remain in reproducing this in large real-world settings. This paper summarizes lessons learned from building a system used to de-identify over one billion real clinical notes, in a fully automated way, that was independently certified by multiple organizations for production use. A fully automated solution requires a very high level of accuracy that does not require manual review. A hybrid context-based model architecture is described, which outperforms a Named Entity Recogniton (NER) - only model by 10% on the i2b2-2014 benchmark. The proposed system makes 50%, 475%, and 575% fewer errors than the comparable AWS, Azure, and GCP services respectively while also outperforming ChatGPT by 33%. It exceeds 98% coverage of sensitive data across 7 European languages, without a need for fine tuning. A second set of described models enable data obfuscation -- replacing sensitive data with random surrogates -- while retaining name, date, gender, clinical, and format consistency. Both the practical need and the solution architecture that provides for reliable & linked anonymized documents are described.

clinical note, de-identification, pipeline, (15 more...)

arXiv.org Artificial Intelligence

2312.08495

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Europe > Spain > Galicia > Madrid (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Health Care Technology > Medical Record (1.00)
Information Technology (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

De-identification of clinical free text using natural language processing: A systematic review of current approaches

Kovačević, Aleksandar, Bašaragin, Bojana, Milošević, Nikola, Nenadić, Goran

arXiv.org Artificial IntelligenceNov-28-2023

Background: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. Objectives: Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems. In addition, we aim to identify challenges and potential research opportunities in this field. Methods: A systematic search in PubMed, Web of Science and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. Results: A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. Majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora.

de-identification, information, uthealth, (16 more...)

arXiv.org Artificial Intelligence

2312.03736

Country:

North America > United States > South Carolina > Charleston County > Charleston (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Missouri > Boone County > Columbia (0.14)
(44 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)

Add feedback

rx-anon -- A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm

Singhofer, Fabian, Garifullina, Aygul, Kern, Mathias, Scherp, Ansgar

arXiv.org Artificial IntelligenceDec-6-2022

Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joined, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. The parameter $\lambda$ allows to give different weight on the relational and textual attributes during the anonymization process. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity for relational attributes as well as for sensitive terms. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.

information loss, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2105.08842

Country:

Europe > United Kingdom (0.14)
North America > Mexico (0.04)
North America > Canada > Ontario (0.04)
(4 more...)

Genre:

Research Report > New Finding (0.66)
Research Report > Promising Solution (0.50)
Overview > Innovation (0.41)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

An Easy-to-use and Robust Approach for the Differentially Private De-Identification of Clinical Textual Documents

Tchouka, Yakini, Couchot, Jean-François, Laiymani, David

arXiv.org Artificial IntelligenceNov-2-2022

Unstructured textual data is at the heart of healthcare systems. For obvious privacy reasons, these documents are not accessible to researchers as long as they contain personally identifiable information. One way to share this data while respecting the legislative framework (notably GDPR or HIPAA) is, within the medical structures, to de-identify it, i.e. to detect the personal information of a person through a Named Entity Recognition (NER) system and then replacing it to make it very difficult to associate the document with the person. The challenge is having reliable NER and substitution tools without compromising confidentiality and consistency in the document. Most of the conducted research focuses on English medical documents with coarse substitutions by not benefiting from advances in privacy. This paper shows how an efficient and differentially private de-identification approach can be achieved by strengthening the less robust de-identification method and by adapting state-of-the-art differentially private mechanisms for substitution purposes. The result is an approach for de-identifying clinical documents in French language, but also generalizable to other languages and whose robustness is mathematically proven.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.01147

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)
North America > United States > Massachusetts (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback