Goto

Collaborating Authors

 FDA


A global log for medical AI

arXiv.org Artificial Intelligence

Modern computer systems often rely on syslog, a simple, universal protocol that records every critical event across heterogeneous infrastructure. However, healthcare's rapidly growing clinical AI stack has no equivalent. As hospitals rush to pilot large language models and other AI-based clinical decision support tools, we still lack a standard way to record how, when, by whom, and for whom these AI models are used. Without that transparency and visibility, it is challenging to measure real-world performance and outcomes, detect adverse events, or correct bias or dataset drift. In the spirit of syslog, we introduce MedLog, a protocol for event-level logging of clinical AI. Any time an AI model is invoked to interact with a human, interface with another algorithm, or act independently, a MedLog record is created. This record consists of nine core fields: header, model, user, target, inputs, artifacts, outputs, outcomes, and feedback, providing a structured and consistent record of model activity. To encourage early adoption, especially in low-resource settings, and minimize the data footprint, MedLog supports risk-based sampling, lifecycle-aware retention policies, and write-behind caching; detailed traces for complex, agentic, or multi-stage workflows can also be captured under MedLog. MedLog can catalyze the development of new databases and software to store and analyze MedLog records. Realizing this vision would enable continuous surveillance, auditing, and iterative improvement of medical AI, laying the foundation for a new form of digital epidemiology.


Intelligent Healthcare Ecosystems: Optimizing the Iron Triangle of Healthcare (Access, Cost, Quality)

arXiv.org Artificial Intelligence

Abstract--The United States spends more on healthcare than any other nation - nearly 17% of GDP as of the early 2020s - yet struggles with uneven access and outcomes [1] [2]. This paradox of high cost, variable quality, and inequitable access is often described by the "Iron Triangle" of healthcare [3], which posits that improvements in one dimension (access, cost, or quality) often come at the expense of the others. This paper explores how an Intelligent Healthcare Ecosystem (iHE) - an integrated system leveraging advanced technologies and data-driven innovation - can "bend" or even break this iron triangle, enabling simultaneous enhancements in access, cost-efficiency, and quality of care. We review historical and current trends in U.S. healthcare spending, including persistent waste and international comparisons, to underscore the need for transformative change. We then propose a conceptual model and strategic framework for iHE, incorporating emerging technologies such as generative AI and large language models (LLMs), federated learning, interoperability standards (FHIR) and nationwide networks (TEFCA), and digital twins. We introduce an updated healthcare value equation that integrates all three corners of the iron triangle, and we hypothesize that an intelligently coordinated ecosystem can maximize this value by delivering high-quality care to more people at lower cost. Methods include a narrative synthesis of recent literature and policy reports, and Results highlight key components and enabling technologies of an iHE. We discuss how such ecosystems can reduce waste, personalize care, enhance interoperability, and support value-based models, all while addressing challenges like privacy, bias, and stakeholder adoption. The paper is formatted per MDPI guidelines, with APA-style numbered references, illustrative figures (U.S. spending trends, waste breakdown, international spending comparison, conceptual models), equations, and a structured layout. Our findings suggest that embracing an Intelligent Healthcare Ecosystem is pivotal for optimizing the long-standing trade-offs in healthcare's iron triangle, moving towards a system that is more accessible, affordable, and of higher quality for all.



Predictive Modeling and Explainable AI for Veterinary Safety Profiles, Residue Assessment, and Health Outcomes Using Real-World Data and Physicochemical Properties

arXiv.org Artificial Intelligence

The safe use of pharmaceuticals in food-producing animals is vital to protect animal welfare and human food safety. Adverse events (AEs) may signal unexpected pharmacokinetic or toxicokinetic effects, increasing the risk of violative residues in the food chain. This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using ~1.28 million reports (1987-2025 Q1) from the U.S. FDA's OpenFDA Center for Veterinary Medicine. A preprocessing pipeline merged relational tables and standardized AEs through VeDDRA ontologies. Data were normalized, missing values imputed, and high-cardinality features reduced; physicochemical drug properties were integrated to capture chemical-residue links. We evaluated supervised models, including Random Forest, CatBoost, XGBoost, ExcelFormer, and large language models (Gemma 3-27B, Phi 3-12B). Class imbalance was addressed, such as undersampling and oversampling, with a focus on prioritizing recall for fatal outcomes. Ensemble methods(Voting, Stacking) and CatBoost performed best, achieving precision, recall, and F1-scores of 0.95. Incorporating Average Uncertainty Margin (AUM)-based pseudo-labeling of uncertain cases improved minority-class detection, particularly in ExcelFormer and XGBoost. Interpretability via SHAP identified biologically plausible predictors, including lung, heart, and bronchial disorders, animal demographics, and drug physicochemical properties. These features were strongly linked to fatal outcomes. Overall, the framework shows that combining rigorous data engineering, advanced machine learning, and explainable AI enables accurate, interpretable predictions of veterinary safety outcomes. The approach supports FARAD's mission by enabling early detection of high-risk drug-event profiles, strengthening residue risk assessment, and informing regulatory and clinical decision-making.


Pharmacophore-Guided Generative Design of Novel Drug-Like Molecules

arXiv.org Artificial Intelligence

The integration of artificial intelligence (AI) in early-stage drug discovery offers unprecedented opportunities for exploring chemical space and accelerating hit-to-lead optimization. However, using docking as a reward function during generative model training is computationally expensive and may yield inaccurate results. Here, we present a novel generative framework that balances pharma-cophore similarity to reference compounds with structural diversity from active molecules. The framework allows users to provide custom reference sets, including FDA-approved drugs or clinical candidates, and guides the de novo generation of potential therapeutics. We demonstrate its applicability through a case study targeting alpha estrogen receptor modulators and antagonists for breast cancer. The generated compounds maintain high pharmacophoric fidelity to known active molecules while introducing substantial structural novelty, suggesting strong potential for functional innovation and patentability. Comprehensive evaluation of the generated molecules against common drug-like properties confirms the robustness and pharmaceutical relevance of the approach.


Reason to Rote: Rethinking Memorization in Reasoning

arXiv.org Artificial Intelligence

Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.


Supplementary Materials A Protein Targets Chosen for Generation

Neural Information Processing Systems

Figure A.1 shows the amino acid sequences corresponding to the three SARS-CoV -2 targets. We used a bidirectional Gated Recurrent Unit (GRU) with a linear output layer as an encoder. Figure B.1: The novelty of the scaffold of each generated molecule compared to the most similar scaffold in the training set. Similarity of the fingerprints, is shown next to the scaffold of each generated molecule. We show a representative set of molecules generated for each target in Figure D.1 Figure D.1: Representative molecules generated for (top to bottom): NSP9 Replicase, Receptor-Binding Domain (RBD) of S protein, and Main Protease of SARS-CoV -2 RBD has maximum subgraph similarity to a commercially available drug Telavancin (See Figure E.3).



Evaluating Foundation Models with Pathological Concept Learning for Kidney Cancer

arXiv.org Artificial Intelligence

To evaluate the translational capabilities of foundation models, we develop a pathological concept learning approach focused on kidney cancer. By leveraging TNM staging guidelines and pathology reports, we build comprehensive pathological concepts for kidney cancer. Then, we extract deep features from whole slide images using foundation models, construct pathological graphs to capture spatial correlations, and trained graph neural networks to identify these concepts. Finally, we demonstrate the effectiveness of this approach in kidney cancer survival analysis, highlighting its explainability and fairness in identifying low- and high-risk patients. The source code has been released by https://github.com/shangqigao/RadioPath.


From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

arXiv.org Artificial Intelligence

Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower by 35-60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development.