AITopics | Shah, Nigam

Collaborating Authors

Shah, Nigam

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

Cui, Hejie, Unell, Alyssa, Chen, Bowen, Fries, Jason Alan, Alsentzer, Emily, Koyejo, Sanmi, Shah, Nigam

arXiv.org Artificial IntelligenceMar-6-2025

Tasks such as chronic disease Large language models (LLMs) have emerged management, multi-visit care planning, and patient history as promising tools for assisting in medical tasks, synthesis require clinicians to understand complex relationships yet processing Electronic Health Records (EHRs) between different record entries and how past events presents unique challenges due to their longitudinal influence current and future clinical decisions (Wornow nature. While LLMs' capabilities to perform et al., 2024). The cognitive demands of processing such medical tasks continue to improve, their ability lengthy documentation are significant. While biomedical to reason over temporal dependencies across LLMs have shown promising results on well-structured multiple patient visits and time frames remains tasks like answering USMLE questions and medical knowledge unexplored. We introduce TIMER (Temporal retrieval (Singhal et al., 2023; Lu et al., 2024; Lucas Instruction Modeling and Evaluation for Longitudinal et al., 2024), recent evaluations reveal their significant limitations Clinical Records), a framework that incorporate in processing longitudinal patient information and in instruction-response pairs grounding to making clinical decisions over time (Hager et al., 2024; Bedi different parts of a patient's record as a critical et al., 2024). The gap between isolated question-answering dimension in both instruction evaluation and tuning performance and temporal reasoning ability impacts the for longitudinal clinical records. We develop practical utility of LLMs in healthcare. While there is some TIMER-Bench, the first time-aware benchmark prior work that has explored temporal understanding abilities that evaluates temporal reasoning capabilities over of general LLMs (Wang & Zhao, 2024; Fatemi et al., longitudinal EHRs, as well as TIMER-Instruct, 2024; Herel et al., 2024), how these capabilities scale to an instruction-tuning methodology for LLMs to longer contexts remains understudied, particularly in healthcare learn reasoning over time. We demonstrate that where longitudinal reasoning is important.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2503.04176

Genre: Research Report (1.00)

Industry: Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

Chung, Philip, Swaminathan, Akshay, Goodell, Alex J., Kim, Yeasul, Reincke, S. Momsen, Han, Lichy, Deverett, Ben, Sadeghi, Mohammad Amin, Ariss, Abdel-Badih, Ghanem, Marc, Seong, David, Lee, Andrew A., Coombes, Caitlin E., Bradshaw, Brad, Sufian, Mahir A., Hong, Hyo Jung, Nguyen, Teresa P., Rasouli, Mohammad R., Kamra, Komal, Burbridge, Mark A., McAvoy, James C., Saffary, Roya, Ma, Stephen P., Dash, Dev, Xie, James, Wang, Ellen Y., Schmiesing, Clifford A., Shah, Nigam, Aghaeepour, Nima

arXiv.org Artificial IntelligenceJan-27-2025

Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient's medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinician ground truth, suggesting that VeriFact exceeds the average clinician's ability to fact-check text against a patient's medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.16672

Country:

North America > United States > Massachusetts (0.14)
Europe > United Kingdom > England (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Assessing the Limitations of Large Language Models in Clinical Fact Decomposition

Munnangi, Monica, Swaminathan, Akshay, Fries, Jason Alan, Jindal, Jenelle, Narayanan, Sanjana, Lopez, Ivan, Tu, Lucia, Chung, Philip, Omiye, Jesutofunmi A., Kashyap, Mehr, Shah, Nigam

arXiv.org Artificial IntelligenceDec-16-2024

Verifying factual claims is critical for using large language models (LLMs) in healthcare. Recent work has proposed fact decomposition, which uses LLMs to rewrite source text into concise sentences conveying a single piece of information, as an approach for fine-grained fact verification. Clinical documentation poses unique challenges for fact decomposition due to dense terminology and diverse note types. To explore these challenges, we present FactEHR, a dataset consisting of full document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems. Our evaluation, including review by clinicians, highlights significant variability in the quality of fact decomposition for four commonly used LLMs, with some LLMs generating 2.6x more facts per sentence than others. The results underscore the need for better LLM capabilities to support factual verification in clinical text. To facilitate future research in this direction, we plan to release our code at \url{https://github.com/som-shahlab/factehr}.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.12422

Country:

North America > United States > California (0.68)
Europe (0.67)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.68)
Health & Medicine > Therapeutic Area > Oncology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability

Tan, Ting Fang, Elangovan, Kabilan, Ong, Jasmine, Shah, Nigam, Sung, Joseph, Wong, Tien Yin, Xue, Lan, Liu, Nan, Wang, Haibo, Kuo, Chang Fu, Chesterman, Simon, Yeong, Zee Kin, Ting, Daniel SW

arXiv.org Artificial IntelligenceJul-10-2024

A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2407.07666

Country:

Asia (0.74)
North America > United States > California > San Francisco County > San Francisco (0.28)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Health Care Providers & Services (0.48)
Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MOTOR: A Time-To-Event Foundation Model For Structured Medical Records

Steinberg, Ethan, Fries, Jason, Xu, Yizhe, Shah, Nigam

arXiv.org Artificial IntelligenceDec-4-2023

We present a self-supervised, time-to-event (TTE) foundation model called MOTOR (Many Outcome Time Oriented Representations) which is pretrained on timestamped sequences of events in electronic health records (EHR) and health insurance claims. TTE models are used for estimating the probability distribution of the time until a specific event occurs, which is an important task in medical settings. TTE models provide many advantages over classification using fixed time horizons, including naturally handling censored observations, but are challenging to train with limited labeled data. MOTOR addresses this challenge by pretraining on up to 55M patient records (9B clinical events). We evaluate MOTOR's transfer learning performance on 19 tasks, across 3 patient databases (a private EHR system, MIMIC-IV, and Merative claims data). Task-specific models adapted from MOTOR improve time-dependent C statistics by 4.6% over state-of-the-art, improve label efficiency by up to 95% ,and are more robust to temporal distributional shifts. We further evaluate cross-site portability by adapting our MOTOR foundation model for six prediction tasks on the MIMIC-IV dataset, where it outperforms all baselines. MOTOR is the first foundation model for medical TTE predictions and we release a 143M parameter pretrained model for research use at [redacted URL].

data mining, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2301.0315

Country: North America > United States > Louisiana (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records

Guo, Lin Lawrence, Fries, Jason, Steinberg, Ethan, Fleming, Scott Lanyon, Morse, Keith, Aftandilian, Catherine, Posada, Jose, Shah, Nigam, Sung, Lillian

arXiv.org Artificial IntelligenceNov-19-2023

Foundation models hold promise for transforming AI in healthcare by providing modular components that are easily adaptable to downstream healthcare tasks, making AI development more scalable and cost-effective. Structured EHR foundation models, trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across different hospitals and their performance for local task adaptation. This multi-center study examined the adaptability of a recently released structured EHR foundation model ($FM_{SM}$), trained on longitudinal medical record data from 2.57M Stanford Medicine patients. Experiments were conducted using EHR data at The Hospital for Sick Children and MIMIC-IV. We assessed both adaptability via continued pretraining on local data, and task adaptability compared to baselines of training models from scratch at each site, including a local foundation model. We evaluated the performance of these models on 8 clinical prediction tasks. In both datasets, adapting the off-the-shelf $FM_{SM}$ matched the performance of GBM models locally trained on all data while providing a 13% improvement in settings with few task-specific training labels. With continued pretraining on local data, label efficiency substantially improved, such that $FM_{SM}$ required fewer than 1% of training examples to match the fully trained GBM's performance. Continued pretraining was also 60 to 90% more sample-efficient than training local foundation models from scratch. Our findings show that adapting shared EHR foundation models across hospitals provides improved prediction performance at less cost, underscoring the utility of base foundation models as modular components to streamline the development of healthcare AI.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2311.11483

Country: North America > United States > California > Santa Clara County > Palo Alto (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Health Care Providers & Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature

Lozano, Alejandro, Fleming, Scott L, Chiang, Chia-Chun, Shah, Nigam

arXiv.org Artificial IntelligenceOct-24-2023

The quickly-expanding nature of published medical literature makes it challenging for clinicians and researchers to keep up with and summarize recent, relevant findings in a timely manner. While several closed-source summarization tools based on large language models (LLMs) now exist, rigorous and systematic evaluations of their outputs are lacking. Furthermore, there is a paucity of high-quality datasets and appropriate benchmark tasks with which to evaluate these tools. We address these issues with four contributions: we release Clinfo.ai, an open-source WebApp that answers clinical questions based on dynamically retrieved scientific literature; we specify an information retrieval and abstractive summarization task to evaluate the performance of such retrieval-augmented LLM systems; we release a dataset of 200 questions and corresponding answers derived from published systematic reviews, which we name PubMed Retrieval and Synthesis (PubMedRS-200); and report benchmark results for Clinfo.ai and other publicly available OpenQA systems on PubMedRS-200.

artificial intelligence, large language model, natural language, (5 more...)

arXiv.org Artificial Intelligence

2310.16146

Genre: Research Report (0.40)

Industry: Health & Medicine (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects

Yadlowsky, Steve, Fleming, Scott, Shah, Nigam, Brunskill, Emma, Wager, Stefan

arXiv.org Machine LearningNov-15-2021

There are a number of available methods that can be used for choosing whom to prioritize treatment, including ones based on treatment effect estimation, risk scoring, and hand-crafted rules. We propose rank-weighted average treatment effect (RATE) metrics as a simple and general family of metrics for comparing treatment prioritization rules on a level playing field. RATEs are agnostic as to how the prioritization rules were derived, and only assesses them based on how well they succeed in identifying units that benefit the most from treatment. We define a family of RATE estimators and prove a central limit theorem that enables asymptotically exact inference in a wide variety of randomized and observational study settings. We provide justification for the use of bootstrapped confidence intervals and a framework for testing hypotheses about heterogeneity in treatment effectiveness correlated with the prioritization rule. Our definition of the RATE nests a number of existing metrics, including the Qini coefficient, and our analysis directly yields inference methods for these metrics. We demonstrate our approach in examples drawn from both personalized medicine and marketing. In the medical setting, using data from the SPRINT and ACCORD-BP randomized control trials, we find no significant evidence of heterogeneous treatment effects. On the other hand, in a large marketing trial, we find robust evidence of heterogeneity in the treatment effects of some digital advertising campaigns and demonstrate how RATEs can be used to compare targeting rules that prioritize estimated risk vs. those that prioritize estimated treatment benefit.

artificial intelligence, machine learning, prioritization rule, (20 more...)

arXiv.org Machine Learning

2111.07966

Country: North America > United States (0.46)

Genre:

Research Report > Strength High (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Marketing (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
Energy > Oil & Gas (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)

Add feedback

General-purpose validation and model selection when estimating individual treatment effects

Schuler, Alejandro, Shah, Nigam

arXiv.org Machine LearningApr-13-2018

Practitioners in medicine, business, political science, and other fields are increasingly aware that decisions should be personalized to each patient, customer, or voter. A given treatment (e.g. a drug or advertisement) should be administered only to those who will respond most positively, and certainly not to those who will be harmed by it. Individual-level treatment effects (ITEs) can be estimated with tools adapted from machine learning, but different models can yield contradictory estimates. Unlike risk prediction models, however, treatment effect models cannot be easily evaluated against each other using a held-out test set because the true treatment effect itself is never directly observed. Besides outcome prediction accuracy, several approaches that use held-out data to evaluate treatment effects models have been proposed, but they are largely unknown or cloistered within disciplines. We present a review of these approaches and demonstrate theoretical relationships among them. We demonstrate their behavior using simulations of both randomized and observational data. Based on our empirical and theoretical results, we advocate for the standardized use of estimated decision value for individual treatment effect model selection and validation.

artificial intelligence, health & medicine, treatment effect, (17 more...)

arXiv.org Machine Learning

1804.05146

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.14)
North America > United States > New York > New York County > New York City (0.14)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Health & Medicine > Diagnostic Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Synth-Validation: Selecting the Best Causal Inference Method for a Given Dataset

Schuler, Alejandro, Jung, Ken, Tibshirani, Robert, Hastie, Trevor, Shah, Nigam

arXiv.org Machine LearningOct-31-2017

Many decisions in healthcare, business, and other policy domains are made without the support of rigorous evidence due to the cost and complexity of performing randomized experiments. Using observational data to answer causal questions is risky: subjects who receive different treatments also differ in other ways that affect outcomes. Many causal inference methods have been developed to mitigate these biases. However, there is no way to know which method might produce the best estimate of a treatment effect in a given study. In analogy to cross-validation, which estimates the prediction error of predictive models applied to a given dataset, we propose synth-validation, a procedure that estimates the estimation error of causal inference methods applied to a given dataset. In synth-validation, we use the observed data to estimate generative distributions with known treatment effects. We apply each causal inference method to datasets sampled from these distributions and compare the effect estimates with the known effects to estimate error. Using simulations, we show that using synth-validation to select a causal inference method for each study lowers the expected estimation error relative to consistently using any single method.

health & medicine, optimization problem, treatment effect, (19 more...)

arXiv.org Machine Learning

1711.00083

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.15)
North America > United States > New York > New York County > New York City (0.14)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.86)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)

Add feedback