meds_reader: A fast and efficient EHR processing library

Steinberg, Ethan, Wornow, Michael, Bedi, Suhana, Fries, Jason Alan, McDermott, Matthew B. A., Shah, Nigam H.

Sep-12-2024–arXiv.org Artificial Intelligence

As machine learning (ML) matures in both healthcare and other fields, there is a need to process large datasets for model training, especially with the rise of data-hungry foundation models [Hoffmann et al., 2024]. To meet growing data needs, sophisticated and efficient tooling [Google, PyTorch, HuggingFace] have been developed to help scale analysis to large datasets. However, many of these tools have been difficult to use in research involving electronic health record (EHR) data due to its unique nested event stream data structure [McDermott et al., 2023] consisting of a collection of subjects (also generally referred to as patients), where each subject contains a sequence of discrete time-stamped events with associated per-event data. Figure 1 illustrates this event stream structure for an example subject. This event stream data structure is poorly handled by existing data processing tools that are optimized for tabular data, images, or text. These differences have forced healthcare ML researchers to build their own data processing pipelines [Yang et al., 2023, Gupta et al., 2022, Tang et al., 2020, McDermott et al., 2023] for handling EHR data, which tend to be very inefficient in terms of memory, CPU, and disk usage. In this work, we help with these inefficiency issues by introducing meds_reader, an open-source Python package that can be used for building fast and efficient EHR ML processing pipelines. We demonstrate the benefits of meds_reader by using it to reimplement labeling and featurization within two existing EHR processing pipelines, achieving 10-100x improvements in CPU, memory, and disk usage.

pipeline, processing pipeline, reader, (10 more...)

arXiv.org Artificial Intelligence

Sep-12-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)
- Asia > Middle East
  - Israel (0.04)
  - Jordan (0.04)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine
  - Health Care Technology > Medical Record (0.56)
  - Health Care Providers & Services (0.46)

Technology:
- Information Technology
  - Data Science (1.00)
  - Software (0.89)
  - Artificial Intelligence > Machine Learning
    - Neural Networks > Deep Learning (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found