AITopics | data harmonization

Collaborating Authors

data harmonization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An AI Implementation Science Study to Improve Trustworthy Data in a Large Healthcare System

Marteau, Benoit L., Hornback, Andrew, Tan, Shaun Q., Lowson, Christian, Woloff, Jason, Wang, May D.

arXiv.org Artificial IntelligenceDec-8-2025

The rapid growth of Artificial Intelligence (AI) in healthcare has sparked interest in Trustworthy AI and AI Implementation Science, both of which are essential for accelerating clinical adoption. However, strict regulations, gaps between research and clinical settings, and challenges in evaluating AI systems continue to hinder real-world implementation. This study presents an AI implementation case study within Shriners Childrens (SC), a large multisite pediatric system, showcasing the modernization of SCs Research Data Warehouse (RDW) to OMOP CDM v5.4 within a secure Microsoft Fabric environment. We introduce a Python-based data quality assessment tool compatible with SCs infrastructure, extending OHDsi's R/Java-based Data Quality Dashboard (DQD) and integrating Trustworthy AI principles using the METRIC framework. This extension enhances data quality evaluation by addressing informative missingness, redundancy, timeliness, and distributional consistency. We also compare systematic and case-specific AI implementation strategies for Craniofacial Microsomia (CFM) using the FHIR standard. Our contributions include a real-world evaluation of AI implementations, integration of Trustworthy AI principles into data quality assessment, and insights into hybrid implementation strategies that blend systematic infrastructure with use-case-driven approaches to advance AI in healthcare.

artificial intelligence, data quality, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.03098

Country: North America > United States (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Providers & Services (0.95)
Health & Medicine > Therapeutic Area > Pediatrics/Neonatology (0.34)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Applied AI (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.68)

Add feedback

PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research

Gronsbell, Jessica, Panickan, Vidul Ayakulangara, Lin, Chris, Charlon, Thomas, Hong, Chuan, Zhou, Doudou, Wang, Linshanshan, Gao, Jianhui, Zhou, Shirley, Tian, Yuan, Shi, Yaqi, Gan, Ziming, Cai, Tianxi

arXiv.org Machine LearningSep-11-2025

Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major challenges due to data heterogeneity, semantic differences, and privacy concerns. To address these challenges, we introduce $\textit{PEHRT}$, a standardized pipeline for efficient EHR data harmonization consisting of two core modules: (1) data pre-processing and (2) representation learning. PEHRT maps EHR data to standard coding systems and uses advanced machine learning to generate research-ready datasets without requiring individual-level data sharing. Our pipeline is also data model agnostic and designed for streamlined execution across institutions based on our extensive real-world experience. We provide a complete suite of open source software, accompanied by a user-friendly tutorial, and demonstrate the utility of PEHRT in a variety of tasks using data from diverse healthcare systems.

ehr data, electronic health record, pehrt, (11 more...)

arXiv.org Machine Learning

2509.08553

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > Netherlands (0.04)
Asia > Middle East > Israel (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Therapeutic Area > Neurology > Multiple Sclerosis (0.93)

Technology:

Information Technology > Software (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Biomedical Informatics (1.00)
(5 more...)

Add feedback

Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

Kokash, Natallia, Wang, Lei, Gillespie, Thomas H., Belloum, Adam, Grosso, Paola, Quinney, Sara, Li, Lang, de Bono, Bernard

arXiv.org Artificial IntelligenceMay-27-2025

The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.

data mining, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2505.2002

Country:

Europe (1.00)
North America > United States (0.93)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Consumer Health (1.00)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
(2 more...)

Add feedback

A Natural Language Processing Approach to Support Biomedical Data Harmonization: Leveraging Large Language Models

Li, Zexu, Prabhu, Suraj P., Popp, Zachary T., Jain, Shubhi S., Balakundi, Vijetha, Ang, Ting Fang Alvin, Au, Rhoda, Chen, Jinying

arXiv.org Artificial IntelligenceNov-4-2024

Biomedical research requires large, diverse samples to produce unbiased results. Automated methods for matching variables across datasets can accelerate this process. Research in this area has been limited, primarily focusing on lexical matching and ontology based semantic matching. We aimed to develop new methods, leveraging large language models (LLM) and ensemble learning, to automate variable matching. Methods: We utilized data from two GERAS cohort (European and Japan) studies to develop variable matching methods. We first manually created a dataset by matching 352 EU variables with 1322 candidate JP variables, where matched variable pairs were positive and unmatched pairs were negative instances. Using this dataset, we developed and evaluated two types of natural language processing (NLP) methods, which matched variables based on variable labels and definitions from data dictionaries: (1) LLM-based and (2) fuzzy matching. We then developed an ensemble-learning method, using the Random Forest model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model's hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF's probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HRn) and mean reciprocal rank (MRR). Results:E5 performed best among individual methods, achieving 0.90 HR-30 and 0.70 MRR. RF performed better than E5 on all metrics over 50 trials (P less than 0.001) and achieved an average HR 30 of 0.98 and MRR of 0.73. LLM-derived features contributed most to RF's performance. One major cause of errors in automatic variable matching was ambiguous variable definitions within data dictionaries.

derivation rule, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2411.0273

Country:

Asia > Japan (0.25)
North America > United States > Massachusetts > Suffolk County > Boston (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.70)
Health & Medicine > Epidemiology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Machine Learning Approach for Identifying Anatomical Biomarkers of Early Mild Cognitive Impairment

Ahmad, Alwani Liyana, Sanchez-Bornot, Jose, Sotero, Roberto C., Coyle, Damien, Idris, Zamzuri, Faye, Ibrahima

arXiv.org Artificial IntelligenceMay-29-2024

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that primarily affects the aging population by impairing cognitive and motor functions. Early detection of AD through accessible methodologies like magnetic resonance imaging (MRI) is vital for developing effective interventions to halt or slow the disease's progression. This study aims to perform a comprehensive analysis of machine learning techniques for selecting MRI-based biomarkers and classifying individuals into healthy controls (HC) and unstable controls (uHC) who later show mild cognitive impairment within five years. The research utilizes MRI data from the Alzheimer's Disease Neuroinformatics Initiative (ADNI) and the Open Access Series of Imaging Studies 3 (OASIS-3), focusing on both HC and uHC participants. The study addresses the challenges of imbalanced data by testing classification methods on balanced and unbalanced datasets, and harmonizes data using polynomial regression to mitigate nuisance variables like age, gender, and intracranial volume. Results indicate that Gaussian Naive Bayes and RusBoost classifiers shows an optimal performance, achieving accuracies of up to 76.46% and 72.48% respectively on the ADNI dataset. For the OASIS-3 dataset, Kernel Naive Bayes and RusBoost yield accuracies ranging from 64.66% to 75.71%, improving further in age-matched datasets. Brain regions like the entorhinal cortex, hippocampus, lateral ventricle, and lateral orbitofrontal cortex are identified as significantly impacted during early cognitive decline. Despite limitations such as small sample sizes, the study's harmonization approach enhances the robustness of biomarker selection, suggesting the potential of this semi-automatic machine learning pipeline for early AD detection using MRI.

alzheimer, classification method, dataset, (14 more...)

arXiv.org Artificial Intelligence

2407.0004

Country:

North America > United States > California (0.28)
Asia > Malaysia (0.14)
North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Federated Learning over Harmonized Data Silos

Stripelis, Dimitris, Ambite, Jose Luis

arXiv.org Artificial IntelligenceMay-15-2023

Federated Learning is a distributed machine learning approach that enables geographically distributed data silos to collaboratively learn a joint machine learning model without sharing data. Most of the existing work operates on unstructured data, such as images or text, or on structured data assumed to be consistent across the different sites. However, sites often have different schemata, data formats, data values, and access patterns. The field of data integration has developed many methods to address these challenges, including techniques for data exchange and query rewriting using declarative schema mappings, and for entity linkage. Therefore, we propose an architectural vision for an end-to-end Federated Learning and Integration system, incorporating the critical steps of data harmonization and data imputation, to spur further research on the intersection of data management information systems and machine learning.

artificial intelligence, federated learning, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2305.08985

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Quebec > Montreal (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Education (0.94)
Information Technology > Security & Privacy (0.93)
Health & Medicine > Diagnostic Medicine > Imaging (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.88)

Add feedback

AI, Health Insurance, And Data Harmonization: Interview With Shiv Misra, CVS Health

#artificialintelligenceOct-24-2020, 22:40:32 GMT

Over the last decade, data and analytics have grown to be more than just a quantitative support function. Many organizations have traditionally used data to win customers and market share. However they are now also leveraging data to re-design future products based on evolving customer needs and macro trends. While significant progress has been made in the field of machine learning, as well as artificial intelligence –there is one critical element to making this all work: having the right data. Business decisions that are built using flawed data can cause an organization significant revenue loss, increased expenses, compliance issues, possible legal issues and even more severe ramifications.

artificial intelligence, data mining, shiv misra, (13 more...)

#artificialintelligence

Country: North America > United States (0.16)

Industry:

Health & Medicine > Consumer Health (0.52)
Banking & Finance > Insurance (0.51)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Data Science > Data Mining (0.32)

Add feedback

AI, Health Insurance, And Data Harmonization: Interview With Shiv Misra, CVS Health

#artificialintelligenceOct-24-2020, 22:40:31 GMT

While significant progress has been made in the field of machine learning, as well as artificial intelligence –there is one critical element to making this …

artificial intelligence, data harmonization, machine learning, (4 more...)

#artificialintelligence

Industry:

Retail (0.40)
Media > News (0.40)
Health & Medicine > Consumer Health (0.40)
Banking & Finance > Insurance (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.55)

Add feedback

Multi-Stage Prediction Networks for Data Harmonization

Blumberg, Stefano B., Palombo, Marco, Khoo, Can Son, Tax, Chantal M. W., Tanno, Ryutaro, Alexander, Daniel C.

arXiv.org Machine LearningJul-26-2019

In this paper, we introduce multi-task learning (MTL) to data harmonization (DH); where we aim to harmonize images across different acquisition platforms and sites. This allows us to integrate information from multiple acquisitions and improve the predictive performance and learning efficiency of the harmonization model. Specifically, we introduce the Multi Stage Prediction (MSP) Network, a MTL framework that incorporates neural networks of potentially disparate architectures, trained for different individual acquisition platforms, into a larger architecture that is refined in unison. The MSP utilizes high-level features of single networks for individual tasks, as inputs of additional neural networks to inform the final prediction, therefore exploiting redundancy across tasks to make the most of limited training data. We validate our methods on a dMRI harmonization challenge dataset, where we predict three modern platform types, from one obtained from an old scanner. We show how MTL architectures, such as the MSP, produce around 20\% improvement of patch-based mean-squared error over current state-of-the-art methods and that our MSP outperforms off-the-shelf MTL networks. Our code is available https://github.com/sbb-gh/ .

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Machine Learning

1907.11629

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

FIBO, FIBO, It's Off to Work We Go

#artificialintelligenceApr-15-2018, 21:06:24 GMT

W3C Standards Work: FIBO is expressed in the standard W3C semantic modeling language, OWL, which is natively supported by the Anzo Smart Data Lake. Loading FIBO into Anzo was a simple import function. FIBO Works: There was an excellent match between the FIBO model and the data sources (Front Arena and Dun & Bradstreet). Mapping & Loading Data is Easy: The alignment between FIBO and the data sources made mapping fast and easy. Once mapped, data loading and transformation was automatic.

anzo smart data lake, data source, fibo, (12 more...)

#artificialintelligence

Technology:

Information Technology > Information Management (0.99)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback