Goto

Collaborating Authors

 Do, Thanh Son


Efficient Standardization of Clinical Notes using Large Language Models

arXiv.org Artificial Intelligence

Clinician notes are a rich source of patient information but often contain inconsistencies due to varied writing styles, colloquialisms, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder the extraction of meaningful data from electronic health records (EHRs), posing challenges for quality improvement, population health, precision medicine, decision support, and research. We present a large language model approach to standardizing a corpus of 1,618 clinical notes. Standardization corrected an average of $4.9 +/- 1.8$ grammatical errors, $3.3 +/- 5.2$ spelling errors, converted $3.1 +/- 3.0$ non-standard terms to standard terminology, and expanded $15.8 +/- 9.1$ abbreviations and acronyms per note. Additionally, notes were re-organized into canonical sections with standardized headings. This process prepared notes for key concept extraction, mapping to medical ontologies, and conversion to interoperable data formats such as FHIR. Expert review of randomly sampled notes found no significant data loss after standardization. This proof-of-concept study demonstrates that standardization of clinical notes can improve their readability, consistency, and usability, while also facilitating their conversion into interoperable data formats.


When Less Is Not More: Large Language Models Normalize Less-Frequent Terms with Lower Accuracy

arXiv.org Artificial Intelligence

Term normalization is the process of mapping a term from free text to a standardized concept and its machine-readable code in an ontology. Accurate normalization of terms that capture phenotypic differences between patients and diseases is critical to the success of precision medicine initiatives. A large language model (LLM), such as GPT-4o, can normalize terms to the Human Phenotype Ontology (HPO), but it may retrieve incorrect HPO IDs. Reported accuracy rates for LLMs on these tasks may be inflated due to imbalanced test datasets skewed towards high-frequency terms. In our study, using a comprehensive dataset of 268,776 phenotype annotations for 12,655 diseases from the HPO, GPT-4o achieved an accuracy of 13.1% in normalizing 11,225 unique terms. However, the accuracy was unevenly distributed, with higher-frequency and shorter terms normalized more accurately than lower-frequency and longer terms. Feature importance analysis, using SHAP and permutation methods, identified low-term frequency as the most significant predictor of normalization errors. These findings suggest that training and evaluation datasets for LLM-based term normalization should balance low- and high-frequency terms to improve model performance, particularly for infrequent terms critical to precision medicine.


A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances LLM accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM), we demonstrate that the normalization accuracy of a state-of-the-art LLM increases from a baseline of 62.3% without augmentation to 90.3% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.