PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research
Gronsbell, Jessica, Panickan, Vidul Ayakulangara, Lin, Chris, Charlon, Thomas, Hong, Chuan, Zhou, Doudou, Wang, Linshanshan, Gao, Jianhui, Zhou, Shirley, Tian, Yuan, Shi, Yaqi, Gan, Ziming, Cai, Tianxi
Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major challenges due to data heterogeneity, semantic differences, and privacy concerns. To address these challenges, we introduce $\textit{PEHRT}$, a standardized pipeline for efficient EHR data harmonization consisting of two core modules: (1) data pre-processing and (2) representation learning. PEHRT maps EHR data to standard coding systems and uses advanced machine learning to generate research-ready datasets without requiring individual-level data sharing. Our pipeline is also data model agnostic and designed for streamlined execution across institutions based on our extensive real-world experience. We provide a complete suite of open source software, accompanied by a user-friendly tutorial, and demonstrate the utility of PEHRT in a variety of tasks using data from diverse healthcare systems.
Sep-11-2025
- Country:
- Asia > Middle East
- Israel (0.04)
- Europe > Netherlands (0.04)
- North America > United States
- Massachusetts > Suffolk County > Boston (0.04)
- Asia > Middle East
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.47)
- Natural Language
- Large Language Model (0.94)
- Text Processing (0.93)
- Representation & Reasoning (1.00)
- Machine Learning > Neural Networks
- Biomedical Informatics (1.00)
- Data Science
- Data Mining (1.00)
- Data Quality (0.90)
- Software (1.00)
- Artificial Intelligence
- Information Technology