sepsis
Training and Evaluation of Guideline-Based Medical Reasoning in LLMs
Staniek, Michael, Sokolov, Artem, Riezler, Stefan
Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- Pacific Ocean > North Pacific Ocean > Gulf of Thailand (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (7 more...)
Data reuse enables cost-efficient randomized trials of medical AI models
Nercessian, Michael, Zhang, Wenxin, Schubert, Alexander, Yang, Daphne, Chung, Maggie, Alaa, Ahmed, Yala, Adam
Joint Senior Corresponding Author: Michael Nercessian Email: michael.nercessian@berkeley.edu Abstract Randomized controlled trials (RCTs) are indispensable for establishing the clinical value of medical artificial-intelligence (AI) tools, yet their high cost and long timelines hinder timely validation as new models emerge rapidly. Here, we propose BRIDGE, a data-reuse RCT design for AI-based risk models. AI risk models support a broad range of interventions, including screening, treatment selection, and clinical alerts. BRIDGE trials recycle participant-level data from completed trials of AI models when legacy and updated models make concordant predictions, thereby reducing the enrollment requirement for subsequent trials. We provide a practical checklist for investigators to assess whether reusing data from previous trials allows for valid causal inference and preserves type I error. Using real-world datasets across breast cancer, cardiovascular disease, and sepsis, we demonstrate concordance between successive AI models, with up to 64.8% overlap in top 5% high-risk cohorts. We then simulate a series of breast cancer screening studies, where our design reduced required enrollment by 46.6%--saving over US$2.8 million--while maintaining 80% power. By transforming trials into adaptive, modular studies, our proposed design makes Level I evidence generation feasible for every model iteration, thereby accelerating cost-effective translation of AI into routine care . Introduction Artificial intelligence (AI) models have the potential to transform patient care by identifying high-risk individuals using high-dimensional data--such as imaging, electronic health records, or time-series data--to personalize screening, prevention, and treatment decisions across a range of diseases, including cancer and heart disease.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Research Report > Strength High (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.57)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Applied AI (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Natural Language (0.68)
reviews. 2 Reviewer 1 comments that the experiments could be on larger problems - we agree that this is fair, and this is certainly 3 of interest going forward
We thank all our reviewers for their time and for their kind words. Reviewer 2 comments that some details are unclear. These details were put in the appendix for space; we will add them to the main paper. As you suggest, the model implements a feedforward step analogous to an RNN. On the number of function evaluations (NFE), leaving this out was an oversight.
Explainable AI For Early Detection Of Sepsis
Thakur, Atharva, Dhumal, Shruti
Department of Multidisciplinary Engineering (AI & DS) Vishwakarma Institute of Technology, Pune, 411037, Maharashtra, India Abstract - Sepsis is a potentially fatal medical disorder that needs to be identified and treated right away to avoid fatalities. It must be quickly identified and treated in order to stop it from developing into severe sepsis, septic shock, and multi-organ failure. Sepsis remains a significant problem for doctors despite advancements in medical technology and treatment methods. The beginning of the disease has been successfully predicted by machine learning models in recent years, but due to their black-box character, it is challenging to interpret these predictions and comprehend the underlying illness mechanisms. In this research, we propose a comprehensible AI method for sepsis analysis that combines machine learning with clinical knowledge and expertise in the domain. Our method allows clinicians to understand and verify the model's predictions based on clinical expertise and preexisting beliefs, in addition to providing precise predictions of the onset of sepsis. Keywords - Sepsis, Artificial Intelligence, Machine Learning, Explainable AI, Sensitivity Analysis I. INTRODUCTION As the world continues to advance in technology, the potential of artificial intelligence (AI) in healthcare is becoming more apparent.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.91)
- Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.89)
Stable Prediction of Adverse Events in Medical Time-Series Data
Keoliya, Mayank, Choi, Seewon, Alur, Rajeev, Naik, Mayur, Wong, Eric
Early event prediction (EEP) systems continuously estimate a patient's imminent risk to support clinical decision-making. For bedside trust, risk trajectories must be accurate and temporally stable, shifting only with new, relevant evidence. However, current benchmarks (a) ignore stability of risk scores and (b) evaluate mainly on tabular inputs, leaving trajectory behavior untested. To address this gap, we introduce CAREBench, an EEP benchmark that evaluates deployability using multi-modal inputs-tabular EHR, ECG waveforms, and clinical text-and assesses temporal stability alongside predictive accuracy. We propose a stability metric that quantifies short-term variability in per-patient risk and penalizes abrupt oscillations based on local-Lipschitz constants. CAREBench spans six prediction tasks such as sepsis onset and compares classical learners, deep sequence models, and zero-shot LLMs. Across tasks, existing methods, especially LLMs, struggle to jointly optimize accuracy and stability, with notably poor recall at high-precision operating points. These results highlight the need for models that produce evidence-aligned, stable trajectories to earn clinician trust in continuous monitoring settings. (Code: https://github.com/SeewonChoi/CAREBench.)
- North America > United States > Pennsylvania (0.04)
- Europe > Netherlands (0.04)
- Asia > Middle East > Israel (0.04)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.46)
HealthProcessAI: A Technical Framework and Proof-of-Concept for LLM-Enhanced Healthcare Process Mining
Illueca-Fernandez, Eduardo, Chen, Kaile, Seoane, Fernando, Abtahi, Farhad
Process mining has emerged as a powerful analytical technique for understanding complex healthcare workflows. However, its application faces significant barriers, including technical complexity, a lack of standardized approaches, and limited access to practical training resources. We introduce HealthProcessAI, a GenAI framework designed to simplify process mining applications in healthcare and epidemiology by providing a comprehensive wrapper around existing Python (PM4PY) and R (bupaR) libraries. To address unfamiliarity and improve accessibility, the framework integrates multiple Large Language Models (LLMs) for automated process map interpretation and report generation, helping translate technical analyses into outputs that diverse users can readily understand. We validated the framework using sepsis progression data as a proof-of-concept example and compared the outputs of five state-of-the-art LLM models through the OpenRouter platform. To test its functionality, the framework successfully processed sepsis data across four proof-of-concept scenarios, demonstrating robust technical performance and its capability to generate reports through automated LLM analysis. LLM evaluation using five independent LLMs as automated evaluators revealed distinct model strengths: Claude Sonnet-4 and Gemini 2.5-Pro achieved the highest consistency scores (3.79/4.0 and 3.65/4.0) when evaluated by automated LLM assessors. By integrating multiple Large Language Models (LLMs) for automated interpretation and report generation, the framework addresses widespread unfamiliarity with process mining outputs, making them more accessible to clinicians, data scientists, and researchers. This structured analytics and AI-driven interpretation combination represents a novel methodological advance in translating complex process mining results into potentially actionable insights for healthcare applications.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Instructional Material (0.86)
End to End Autoencoder MLP Framework for Sepsis Prediction
Cai, Hejiang, Wu, Di, Xu, Ji, Liu, Xiang, Zhu, Yiziting, Shu, Xin, Li, Yujie, Yi, Bin
Sepsis is a life threatening condition that requires timely detection in intensive care settings. Traditional machine learning approaches, including Naive Bayes, Support Vector Machine (SVM), Random Forest, and XGBoost, often rely on manual feature engineering and struggle with irregular, incomplete time-series data commonly present in electronic health records. We introduce an end-to-end deep learning framework integrating an unsupervised autoencoder for automatic feature extraction with a multilayer perceptron classifier for binary sepsis risk prediction. To enhance clinical applicability, we implement a customized down sampling strategy that extracts high information density segments during training and a non-overlapping dynamic sliding window mechanism for real-time inference. Preprocessed time series data are represented as fixed dimension vectors with explicit missingness indicators, mitigating bias and noise. We validate our approach on three ICU cohorts. Our end-to-end model achieves accuracies of 74.6 percent, 80.6 percent, and 93.5 percent, respectively, consistently outperforming traditional machine learning baselines. These results demonstrate the framework's superior robustness, generalizability, and clinical utility for early sepsis detection across heterogeneous ICU environments.
- Research Report > Experimental Study (0.95)
- Research Report > New Finding (0.67)