Shah, Nigam
TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
Cui, Hejie, Unell, Alyssa, Chen, Bowen, Fries, Jason Alan, Alsentzer, Emily, Koyejo, Sanmi, Shah, Nigam
Tasks such as chronic disease Large language models (LLMs) have emerged management, multi-visit care planning, and patient history as promising tools for assisting in medical tasks, synthesis require clinicians to understand complex relationships yet processing Electronic Health Records (EHRs) between different record entries and how past events presents unique challenges due to their longitudinal influence current and future clinical decisions (Wornow nature. While LLMs' capabilities to perform et al., 2024). The cognitive demands of processing such medical tasks continue to improve, their ability lengthy documentation are significant. While biomedical to reason over temporal dependencies across LLMs have shown promising results on well-structured multiple patient visits and time frames remains tasks like answering USMLE questions and medical knowledge unexplored. We introduce TIMER (Temporal retrieval (Singhal et al., 2023; Lu et al., 2024; Lucas Instruction Modeling and Evaluation for Longitudinal et al., 2024), recent evaluations reveal their significant limitations Clinical Records), a framework that incorporate in processing longitudinal patient information and in instruction-response pairs grounding to making clinical decisions over time (Hager et al., 2024; Bedi different parts of a patient's record as a critical et al., 2024). The gap between isolated question-answering dimension in both instruction evaluation and tuning performance and temporal reasoning ability impacts the for longitudinal clinical records. We develop practical utility of LLMs in healthcare. While there is some TIMER-Bench, the first time-aware benchmark prior work that has explored temporal understanding abilities that evaluates temporal reasoning capabilities over of general LLMs (Wang & Zhao, 2024; Fatemi et al., longitudinal EHRs, as well as TIMER-Instruct, 2024; Herel et al., 2024), how these capabilities scale to an instruction-tuning methodology for LLMs to longer contexts remains understudied, particularly in healthcare learn reasoning over time. We demonstrate that where longitudinal reasoning is important.
VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records
Chung, Philip, Swaminathan, Akshay, Goodell, Alex J., Kim, Yeasul, Reincke, S. Momsen, Han, Lichy, Deverett, Ben, Sadeghi, Mohammad Amin, Ariss, Abdel-Badih, Ghanem, Marc, Seong, David, Lee, Andrew A., Coombes, Caitlin E., Bradshaw, Brad, Sufian, Mahir A., Hong, Hyo Jung, Nguyen, Teresa P., Rasouli, Mohammad R., Kamra, Komal, Burbridge, Mark A., McAvoy, James C., Saffary, Roya, Ma, Stephen P., Dash, Dev, Xie, James, Wang, Ellen Y., Schmiesing, Clifford A., Shah, Nigam, Aghaeepour, Nima
Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient's medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinician ground truth, suggesting that VeriFact exceeds the average clinician's ability to fact-check text against a patient's medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.
Assessing the Limitations of Large Language Models in Clinical Fact Decomposition
Munnangi, Monica, Swaminathan, Akshay, Fries, Jason Alan, Jindal, Jenelle, Narayanan, Sanjana, Lopez, Ivan, Tu, Lucia, Chung, Philip, Omiye, Jesutofunmi A., Kashyap, Mehr, Shah, Nigam
Verifying factual claims is critical for using large language models (LLMs) in healthcare. Recent work has proposed fact decomposition, which uses LLMs to rewrite source text into concise sentences conveying a single piece of information, as an approach for fine-grained fact verification. Clinical documentation poses unique challenges for fact decomposition due to dense terminology and diverse note types. To explore these challenges, we present FactEHR, a dataset consisting of full document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems. Our evaluation, including review by clinicians, highlights significant variability in the quality of fact decomposition for four commonly used LLMs, with some LLMs generating 2.6x more facts per sentence than others. The results underscore the need for better LLM capabilities to support factual verification in clinical text. To facilitate future research in this direction, we plan to release our code at \url{https://github.com/som-shahlab/factehr}.
A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability
Tan, Ting Fang, Elangovan, Kabilan, Ong, Jasmine, Shah, Nigam, Sung, Joseph, Wong, Tien Yin, Xue, Lan, Liu, Nan, Wang, Haibo, Kuo, Chang Fu, Chesterman, Simon, Yeong, Zee Kin, Ting, Daniel SW
A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications.
MOTOR: A Time-To-Event Foundation Model For Structured Medical Records
Steinberg, Ethan, Fries, Jason, Xu, Yizhe, Shah, Nigam
We present a self-supervised, time-to-event (TTE) foundation model called MOTOR (Many Outcome Time Oriented Representations) which is pretrained on timestamped sequences of events in electronic health records (EHR) and health insurance claims. TTE models are used for estimating the probability distribution of the time until a specific event occurs, which is an important task in medical settings. TTE models provide many advantages over classification using fixed time horizons, including naturally handling censored observations, but are challenging to train with limited labeled data. MOTOR addresses this challenge by pretraining on up to 55M patient records (9B clinical events). We evaluate MOTOR's transfer learning performance on 19 tasks, across 3 patient databases (a private EHR system, MIMIC-IV, and Merative claims data). Task-specific models adapted from MOTOR improve time-dependent C statistics by 4.6% over state-of-the-art, improve label efficiency by up to 95% ,and are more robust to temporal distributional shifts. We further evaluate cross-site portability by adapting our MOTOR foundation model for six prediction tasks on the MIMIC-IV dataset, where it outperforms all baselines. MOTOR is the first foundation model for medical TTE predictions and we release a 143M parameter pretrained model for research use at [redacted URL].
A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records
Guo, Lin Lawrence, Fries, Jason, Steinberg, Ethan, Fleming, Scott Lanyon, Morse, Keith, Aftandilian, Catherine, Posada, Jose, Shah, Nigam, Sung, Lillian
Foundation models hold promise for transforming AI in healthcare by providing modular components that are easily adaptable to downstream healthcare tasks, making AI development more scalable and cost-effective. Structured EHR foundation models, trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across different hospitals and their performance for local task adaptation. This multi-center study examined the adaptability of a recently released structured EHR foundation model ($FM_{SM}$), trained on longitudinal medical record data from 2.57M Stanford Medicine patients. Experiments were conducted using EHR data at The Hospital for Sick Children and MIMIC-IV. We assessed both adaptability via continued pretraining on local data, and task adaptability compared to baselines of training models from scratch at each site, including a local foundation model. We evaluated the performance of these models on 8 clinical prediction tasks. In both datasets, adapting the off-the-shelf $FM_{SM}$ matched the performance of GBM models locally trained on all data while providing a 13% improvement in settings with few task-specific training labels. With continued pretraining on local data, label efficiency substantially improved, such that $FM_{SM}$ required fewer than 1% of training examples to match the fully trained GBM's performance. Continued pretraining was also 60 to 90% more sample-efficient than training local foundation models from scratch. Our findings show that adapting shared EHR foundation models across hospitals provides improved prediction performance at less cost, underscoring the utility of base foundation models as modular components to streamline the development of healthcare AI.
Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature
Lozano, Alejandro, Fleming, Scott L, Chiang, Chia-Chun, Shah, Nigam
The quickly-expanding nature of published medical literature makes it challenging for clinicians and researchers to keep up with and summarize recent, relevant findings in a timely manner. While several closed-source summarization tools based on large language models (LLMs) now exist, rigorous and systematic evaluations of their outputs are lacking. Furthermore, there is a paucity of high-quality datasets and appropriate benchmark tasks with which to evaluate these tools. We address these issues with four contributions: we release Clinfo.ai, an open-source WebApp that answers clinical questions based on dynamically retrieved scientific literature; we specify an information retrieval and abstractive summarization task to evaluate the performance of such retrieval-augmented LLM systems; we release a dataset of 200 questions and corresponding answers derived from published systematic reviews, which we name PubMed Retrieval and Synthesis (PubMedRS-200); and report benchmark results for Clinfo.ai and other publicly available OpenQA systems on PubMedRS-200.
Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects
Yadlowsky, Steve, Fleming, Scott, Shah, Nigam, Brunskill, Emma, Wager, Stefan
There are a number of available methods that can be used for choosing whom to prioritize treatment, including ones based on treatment effect estimation, risk scoring, and hand-crafted rules. We propose rank-weighted average treatment effect (RATE) metrics as a simple and general family of metrics for comparing treatment prioritization rules on a level playing field. RATEs are agnostic as to how the prioritization rules were derived, and only assesses them based on how well they succeed in identifying units that benefit the most from treatment. We define a family of RATE estimators and prove a central limit theorem that enables asymptotically exact inference in a wide variety of randomized and observational study settings. We provide justification for the use of bootstrapped confidence intervals and a framework for testing hypotheses about heterogeneity in treatment effectiveness correlated with the prioritization rule. Our definition of the RATE nests a number of existing metrics, including the Qini coefficient, and our analysis directly yields inference methods for these metrics. We demonstrate our approach in examples drawn from both personalized medicine and marketing. In the medical setting, using data from the SPRINT and ACCORD-BP randomized control trials, we find no significant evidence of heterogeneous treatment effects. On the other hand, in a large marketing trial, we find robust evidence of heterogeneity in the treatment effects of some digital advertising campaigns and demonstrate how RATEs can be used to compare targeting rules that prioritize estimated risk vs. those that prioritize estimated treatment benefit.
General-purpose validation and model selection when estimating individual treatment effects
Schuler, Alejandro, Shah, Nigam
Practitioners in medicine, business, political science, and other fields are increasingly aware that decisions should be personalized to each patient, customer, or voter. A given treatment (e.g. a drug or advertisement) should be administered only to those who will respond most positively, and certainly not to those who will be harmed by it. Individual-level treatment effects (ITEs) can be estimated with tools adapted from machine learning, but different models can yield contradictory estimates. Unlike risk prediction models, however, treatment effect models cannot be easily evaluated against each other using a held-out test set because the true treatment effect itself is never directly observed. Besides outcome prediction accuracy, several approaches that use held-out data to evaluate treatment effects models have been proposed, but they are largely unknown or cloistered within disciplines. We present a review of these approaches and demonstrate theoretical relationships among them. We demonstrate their behavior using simulations of both randomized and observational data. Based on our empirical and theoretical results, we advocate for the standardized use of estimated decision value for individual treatment effect model selection and validation.
Synth-Validation: Selecting the Best Causal Inference Method for a Given Dataset
Schuler, Alejandro, Jung, Ken, Tibshirani, Robert, Hastie, Trevor, Shah, Nigam
Many decisions in healthcare, business, and other policy domains are made without the support of rigorous evidence due to the cost and complexity of performing randomized experiments. Using observational data to answer causal questions is risky: subjects who receive different treatments also differ in other ways that affect outcomes. Many causal inference methods have been developed to mitigate these biases. However, there is no way to know which method might produce the best estimate of a treatment effect in a given study. In analogy to cross-validation, which estimates the prediction error of predictive models applied to a given dataset, we propose synth-validation, a procedure that estimates the estimation error of causal inference methods applied to a given dataset. In synth-validation, we use the observed data to estimate generative distributions with known treatment effects. We apply each causal inference method to datasets sampled from these distributions and compare the effect estimates with the known effects to estimate error. Using simulations, we show that using synth-validation to select a causal inference method for each study lowers the expected estimation error relative to consistently using any single method.