meds_reader: A fast and efficient EHR processing library

Steinberg, Ethan, Wornow, Michael, Bedi, Suhana, Fries, Jason Alan, McDermott, Matthew B. A., Shah, Nigam H.

arXiv.org Artificial Intelligence 

As machine learning (ML) matures in both healthcare and other fields, there is a need to process large datasets for model training, especially with the rise of data-hungry foundation models [Hoffmann et al., 2024]. To meet growing data needs, sophisticated and efficient tooling [Google, PyTorch, HuggingFace] have been developed to help scale analysis to large datasets. However, many of these tools have been difficult to use in research involving electronic health record (EHR) data due to its unique nested event stream data structure [McDermott et al., 2023] consisting of a collection of subjects (also generally referred to as patients), where each subject contains a sequence of discrete time-stamped events with associated per-event data. Figure 1 illustrates this event stream structure for an example subject. This event stream data structure is poorly handled by existing data processing tools that are optimized for tabular data, images, or text. These differences have forced healthcare ML researchers to build their own data processing pipelines [Yang et al., 2023, Gupta et al., 2022, Tang et al., 2020, McDermott et al., 2023] for handling EHR data, which tend to be very inefficient in terms of memory, CPU, and disk usage. In this work, we help with these inefficiency issues by introducing meds_reader, an open-source Python package that can be used for building fast and efficient EHR ML processing pipelines. We demonstrate the benefits of meds_reader by using it to reimplement labeling and featurization within two existing EHR processing pipelines, achieving 10-100x improvements in CPU, memory, and disk usage.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found