Dalianis, Hercules
Natural Language Processing for Electronic Health Records in Scandinavian Languages: Norwegian, Swedish, and Danish
Woldaregay, Ashenafi Zebene, Lund, Jørgen Aarmo, Ngo, Phuong Dinh, Tayefi, Mariyam, Burman, Joel, Hansen, Stine, Sillesen, Martin Hylleholt, Dalianis, Hercules, Jenssen, Robert, Ole, Lindsetmo Rolf, Mikalsen, Karl Øyvind
Background: Clinical natural language processing (NLP) refers to the use of computational methods for extracting, processing, and analyzing unstructured clinical text data, and holds a huge potential to transform healthcare in various clinical tasks. Objective: The study aims to perform a systematic review to comprehensively assess and analyze the state-of-the-art NLP methods for the mainland Scandinavian clinical text. Method: A literature search was conducted in various online databases including PubMed, ScienceDirect, Google Scholar, ACM digital library, and IEEE Xplore between December 2022 and February 2024. Further, relevant references to the included articles were also used to solidify our search. The final pool includes articles that conducted clinical NLP in the mainland Scandinavian languages and were published in English between 2010 and 2024. Results: Out of the 113 articles, 18% (n=21) focus on Norwegian clinical text, 64% (n=72) on Swedish, 10% (n=11) on Danish, and 8% (n=9) focus on more than one language. Generally, the review identified positive developments across the region despite some observable gaps and disparities between the languages. There are substantial disparities in the level of adoption of transformer-based models. In essential tasks such as de-identification, there is significantly less research activity focusing on Norwegian and Danish compared to Swedish text. Further, the review identified a low level of sharing resources such as data, experimentation code, pre-trained models, and rate of adaptation and transfer learning in the region. Conclusion: The review presented a comprehensive assessment of the state-of-the-art Clinical NLP for electronic health records (EHR) text in mainland Scandinavian languages and, highlighted the potential barriers and challenges that hinder the rapid advancement of the field in the region.
Data-Constrained Synthesis of Training Data for De-Identification
Vakili, Thomas, Henriksson, Aron, Dalianis, Hercules
Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.
Artificial intelligence to improve clinical coding practice in Scandinavia: a crossover randomized controlled trial
Chomutare, Taridzo, Svenning, Therese Olsen, Hernández, Miguel Ángel Tejedor, Ngo, Phuong Dinh, Budrionis, Andrius, Markljung, Kaisa, Hind, Lill Irene, Torsvik, Torbjørn, Mikalsen, Karl Øyvind, Babic, Aleksandar, Dalianis, Hercules
International Statistical Classification of Diseases and Related Health Problems codes, tenth revision (ICD-10) [1] play an important role in healthcare. All hospitals in Scandinavia record their activity by summarizing patient encounters into ICD-10 codes. Clinical coding directly affects how health institutions function on a daily basis because they are partially reimbursed based on the codes they report. The same codes are used to measure both volume and quality of care, thereby providing an important foundation of knowledge for decision makers at all levels in the healthcare service. Clinical coding is a highly complex and challenging task that requires a deep understanding of both the medical terminology and intricate clinical documentation. Coders must accurately translate detailed patient records into standardized codes, navigating the inherently complex medical language, which make this task prone to errors and inconsistencies.
Implementing a Nordic-Baltic Federated Health Data Network: a case report
Chomutare, Taridzo, Babic, Aleksandar, Peltonen, Laura-Maria, Elunurm, Silja, Lundberg, Peter, Jönsson, Arne, Eneling, Emma, Gerstenberger, Ciprian-Virgil, Siggaard, Troels, Kolde, Raivo, Jerdhaf, Oskar, Hansson, Martin, Makhlysheva, Alexandra, Muzny, Miroslav, Ylipää, Erik, Brunak, Søren, Dalianis, Hercules
Background: Centralized collection and processing of healthcare data across national borders pose significant challenges, including privacy concerns, data heterogeneity and legal barriers. To address some of these challenges, we formed an interdisciplinary consortium to develop a feder-ated health data network, comprised of six institutions across five countries, to facilitate Nordic-Baltic cooperation on secondary use of health data. The objective of this report is to offer early insights into our experiences developing this network. Methods: We used a mixed-method ap-proach, combining both experimental design and implementation science to evaluate the factors affecting the implementation of our network. Results: Technically, our experiments indicate that the network functions without significant performance degradation compared to centralized simu-lation. Conclusion: While use of interdisciplinary approaches holds a potential to solve challeng-es associated with establishing such collaborative networks, our findings turn the spotlight on the uncertain regulatory landscape playing catch up and the significant operational costs.
Is De-identification of Electronic Health Records Possible? OR Can We Use Health Record Corpora for Research?
Dalianis, Hercules (DSV/KTH-Stockholm University) | Nilsson, Gunnar (Department of Neurobiology, Care Sciences and Society, Center for Family and Community Medicine, Karolinska Institutet) | Velupillai, Sumithra (DSV/KTH-Stockholm University)
Today an immense volume of electronic health records (EHRs) is being produced. These health records contain abundant information, in the form of both structured and unstructured data. It is estimated that EHRs contain on average around 60 percent structured information, and 40 percent unstructured information that is mostly free text (Dalianis et al., 2009). A modern health record is very complex and contains a large and diverse amount of data, such as the patient’s chief complaints, diagnoses and treatment, and very often an epicrisis, or discharge letter, together with ICD-10 codes, (ICD-10, 2009). Moreover, the health record also contains information about the patient’s gender, age, times of health care visits, medication, measure values, general condition as well as social situation, drinking and eating habits. Much of this information is written in natural language. All this information in a health record is currently almost never re-used, in particular the parts that are written in free text. We believe that the information contained in EHR data sets is an invaluable source for the development and evaluation of a number of applications, useful both for research purposes as well as health practitioners. For instance, text mining tools for finding new or hidden relations between diagnoses/treatments and social situation, age and gender could be very useful for epidemiological or medical researchers. Moreover, information concerning the health process over time, per patient, clinic or hospital, can be extracted and used for further research. Another application is the use of this data as input for simulation of the health process and for future health needs. Also, such huge health record databases can be used as corpora for the generation of generalized synonyms from specialized medical terminology constitutes another exciting application. We can also foresee a text summarization system applied to an individual patient’s health record, but using knowledge from all text records and conveying the information in the health record at the right level to the specific patient. The data can also be used for developing methods where clinicians in their daily work get automatic assistance and proposals of ICD-10 codes for assigning symptoms or diagnoses, or for validating the already manually assigned ICD-10 codes.