Language Models Are An Effective Patient Representation Learning Technique For Electronic Health Record Data

Steinberg, Ethan, Jung, Ken, Fries, Jason A., Corbin, Conor K., Pfohl, Stephen R., Shah, Nigam H.

arXiv.org Machine Learning 

Language Models Are An Effective Patient Representation Learning Technique For Electronic Health Record Data Ethan Steinberg, Ken Jung, Jason A. Fries, Conor K. Corbin, Stephen R. Pfohl, Nigam H. Shah January 16, 2020 Abstract Widespread adoption of electronic health records (EHRs) has fueled development of clinical outcome models using machine learning. However, patient EHR data are complex, and how to optimally represent them is an open question. This complexity, along with often small training set sizes available to train these clinical outcome models, are two core challenges for training high quality models. In this paper, we demonstrate that learning generic representations from the data of all the patients in the EHR enables better performing prediction models for clinical outcomes, allowing for these challenges to be overcome. We adapt common representation learning techniques used in other domains and find that representations inspired by language models enable a 3.5% mean improvement in AUROC on five clinical outcomes compared to standard baselines, with the average improvement rising to 19% when only a small number of patients are available for training a prediction model for a given clinical outcome. 1 Introduction The widespread adoption of electronic health records (EHRs) has created opportunities for using machine learning to reduce healthcare costs and improve quality of care. EHR data have been used to learn prediction models for clinical outcomes such as mortality [1], sepsis [2], 30-day readmission [3] and others [4, 5]. However, the complexity of patient data poses many obstacles to its effective use. Patient records in EHRs are variable length, high dimensional and sparse, with complex temporal and hierarchical structure. They are comprised of irregularly spaced visits spread across years, with each visit consisting of a subset of thousands of possible diagnosis, procedure, and medication codes as well as lab values and unstructured data such as text or images. In contrast, most off-the-shelf machine learning algorithms expect a fixed length vector of features as input. Manually defining a transformation of patient records into such a representation beyond simple binned counts is time consuming and outcome-dependent, leaving much of the temporal and hierarchical structure of EHRs underutilized when building machine learning models. The challenge of representing EHR data can be addressed by using neural networks to automatically learn how to featurize patient data while learning a model for a given clinical outcome (e.g., mortality or 30 day readmissions) [4].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found