Goto

Collaborating Authors

 patient data


Palantir's access to identifiable NHS England patient data is 'dangerous', MPs say

The Guardian

NHS England said it had'strict policies in place for managing access to patient data'. NHS England said it had'strict policies in place for managing access to patient data'. Palantir's access to identifiable NHS England patient data is'dangerous', MPs say Health service has given US tech firm'unlimited access' to certain data to build integrated platform, according to reports Mon 11 May 2026 08.01 EDTLast modified on Mon 11 May 2026 10.06 EDT MPs have warned that an NHS decision to grant Palantir access to identifiable patient information in its plan to use AI to improve the health service is "dangerous" and will fuel public fears that data privacy is not being prioritised. NHS England has allowed staff from the US tech firm and other contractors to access patient data before it has been pseudonymised, despite internal fears of a "risk of loss of public confidence", the Financial Times reported. The health service made the move to allow Palantir to access the data in recent weeks according to the reports, which revealed an internal NHS briefing that said it would allow "unlimited access to non-NHSE staff" to part of the NHS's federated data platform (FDP), which holds identifiable patient information.


MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Vasilev, Kiril, Misrahi, Alexandre, Jain, Eeshaan, Cheng, Phil F, Liakopoulos, Petros, Michielin, Olivier, Moor, Michael, Bunne, Charlotte

arXiv.org Artificial Intelligence

Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.


FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Lee, Gyubok, Bach, Elea, Yang, Eric, Pollard, Tom, Johnson, Alistair, Choi, Edward, jia, Yugang, Lee, Jong Ha

arXiv.org Artificial Intelligence

The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench--a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them--both of which critically affect question answering performance.


Federated Deep Reinforcement Learning for Privacy-Preserving Robotic-Assisted Surgery

Hafeez, Sana, Mulkana, Sundas Rafat, Imran, Muhammad Ali, Sevegnani, Michele

arXiv.org Artificial Intelligence

The integration of Reinforcement Learning (RL) into robotic-assisted surgery (RAS) holds significant promise for advancing surgical precision, adaptability, and autonomous decision-making. However, the development of robust RL models in clinical settings is hindered by key challenges, including stringent patient data privacy regulations, limited access to diverse surgical datasets, and high procedural variability. To address these limitations, this paper presents a Federated Deep Reinforcement Learning (FDRL) framework that enables decentralized training of RL models across multiple healthcare institutions without exposing sensitive patient information. A central innovation of the proposed framework is its dynamic policy adaptation mechanism, which allows surgical robots to select and tailor patient-specific policies in real-time, thereby ensuring personalized and Optimised interventions. To uphold rigorous privacy standards while facilitating collaborative learning, the FDRL framework incorporates secure aggregation, differential privacy, and homomorphic encryption techniques. Experimental results demonstrate a 60\% reduction in privacy leakage compared to conventional methods, with surgical precision maintained within a 1.5\% margin of a centralized baseline. This work establishes a foundational approach for adaptive, secure, and patient-centric AI-driven surgical robotics, offering a pathway toward clinical translation and scalable deployment across diverse healthcare environments.


A systematic review of trial-matching pipelines using large language models

Morrison, Braxton A., Sushil, Madhumita, Young, Jacob S.

arXiv.org Artificial Intelligence

Matching patients to clinical trial options is critical for identifying novel treatments, especially in oncology. However, manual matching is labor-intensive and error-prone, leading to recruitment delays. Pipelines incorporating large language models (LLMs) offer a promising solution. We conducted a systematic review of studies published between 2020 and 2025 from three academic databases and one preprint server, identifying LLM-based approaches to clinical trial matching. Of 126 unique articles, 31 met inclusion criteria. Reviewed studies focused on matching patient-to-criterion only (n=4), patient-to-trial only (n=10), trial-to-patient only (n=2), binary eligibility classification only (n=1) or combined tasks (n=14). Sixteen used synthetic data; fourteen used real patient data; one used both. Variability in datasets and evaluation metrics limited cross-study comparability. In studies with direct comparisons, the GPT-4 model consistently outperformed other models, even finely-tuned ones, in matching and eligibility extraction, albeit at higher cost. Promising strategies included zero-shot prompting with proprietary LLMs like the GPT-4o model, advanced retrieval methods, and fine-tuning smaller, open-source models for data privacy when incorporation of large models into hospital infrastructure is infeasible. Key challenges include accessing sufficiently large real-world data sets, and deployment-associated challenges such as reducing cost, mitigating risk of hallucinations, data leakage, and bias. This review synthesizes progress in applying LLMs to clinical trial matching, highlighting promising directions and key limitations. Standardized metrics, more realistic test sets, and attention to cost-efficiency and fairness will be critical for broader deployment.


T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images

Wiedeman, Christopher, Sarmakeeva, Anastasiia, Sizikova, Elena, Filienko, Daniil, Lago, Miguel, Delfino, Jana G., Badano, Aldo

arXiv.org Artificial Intelligence

Responsible for approximately two million new cases and over six hundred thousand deaths in 2022 alone (Sung et al., 2021), breast cancer remains a prominent global health concern, and is expected to account nearly one-third of all newly diagnosed cancers among women in the United States (DeSantis et al., 2016). According to the most recent report from International Agency for Research on Cancer (Bray et al., 2024), it is one of the most widespread cancers diagnosed worldwide, both in the number of cases and associated deaths. Consequently, medical imaging techniques are indispensable for screening, diagnosis, and further research into the disease. Historically, the most common imaging technique for breast cancer screening is digital mammography (DM), in which a 2D x-ray projection of a compressed breast is taken. Digital breast tomosynthesis (DBT), a pseudo-3D imaging technique, has been increasingly adopted, demonstrating improved screening performance (Asbeutah et al., 2019; Sprague et al., 2023).


FHIR-RAG-MEDS: Integrating HL7 FHIR with Retrieval-Augmented Large Language Models for Enhanced Medical Decision Support

Kabak, Yildiray, Erturkmen, Gokce B. Laleci, Gencturk, Mert, Namli, Tuncay, Sinaci, A. Anil, Corcoles, Ruben Alcantud, Ballesteros, Cristina Gomez, Abizanda, Pedro, Dogac, Asuman

arXiv.org Artificial Intelligence

In recent years, the field of medical informatics has seen significant advancements with the introduction of medical large language models (LLMs). These models, powered by artificial intelligence, have demonstrated remarkable capabilities in understanding and generating medical text, providing valuable assistance in clinical decision - making, diagnostics, and patient care. Prom inent examples include models such as Meditron [1], BioMistral [2] and OpenBioLLM [3], which have shown considerable promise in various medical applications. However, despite these advancements, the inherent limitations of medical LLMs highlight the need for more robust solutions.


Mitigating Clinician Information Overload: Generative AI for Integrated EHR and RPM Data Analysis

Shetgaonkar, Ankit, Pradhan, Dipen, Arora, Lakshit, Girija, Sanjay Surendranath, Kapoor, Shashank, Raj, Aman

arXiv.org Artificial Intelligence

Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs), offer powerful capabilities for interpreting the complex data landscape in healthcare. In this paper, we present a comprehensive overview of the capabilities, requirements and applications of GenAI for deriving clinical insights and improving clinical efficiency. We first provide some background on the forms and sources of patient data, namely real-time Remote Patient Monitoring (RPM) streams and traditional Electronic Health Records (EHRs). The sheer volume and heterogeneity of this combined data present significant challenges to clinicians and contribute to information overload. In addition, we explore the potential of LLM-powered applications for improving clinical efficiency. These applications can enhance navigation of longitudinal patient data and provide actionable clinical decision support through natural language dialogue. We discuss the opportunities this presents for streamlining clinician workflows and personalizing care, alongside critical challenges such as data integration complexity, ensuring data quality and RPM data reliability, maintaining patient privacy, validating AI outputs for clinical safety, mitigating bias, and ensuring clinical acceptance. We believe this work represents the first summarization of GenAI techniques for managing clinician data overload due to combined RPM / EHR data complexities.


An AI Model for the Brain Is Coming to the ICU

WIRED

The Cleveland Clinic is partnering with San Francisco-based startup Piramidal to develop a large-scale AI model that will be used to monitor patients' brain health in intensive care units. Instead of being trained on text, the system is based on electroencephalogram (EEG) data, which is collected via electrodes placed on the scalp and then read out by a computer in a series of wavy lines. EEG records the brain's electrical activity--and changes in this activity can indicate a problem. In an ICU setting, doctors scan EEG data looking for evidence of seizures, altered consciousness, or a decline in brain function. Currently, doctors rely on continuous EEG monitoring to detect abnormal brain activity in an ICU patient, but they can't monitor every individual patient in real time.


Cross-patient Seizure Onset Zone Classification by Patient-Dependent Weight

Zhao, Xuyang, Sugano, Hidenori, Tanaka, Toshihisa

arXiv.org Artificial Intelligence

Identifying the seizure onset zone (SOZ) in patients with focal epilepsy is essential for surgical treatment and remains challenging due to its dependence on visual judgment by clinical experts. The development of machine learning can assist in diagnosis and has made promising progress. However, unlike data in other fields, medical data is usually collected from individual patients, and each patient has different illnesses, physical conditions, and medical histories, which leads to differences in the distribution of each patient's data. This makes it difficult for a machine learning model to achieve consistently reliable performance in every new patient dataset, which we refer to as the "cross-patient problem." In this paper, we propose a method to fine-tune a pretrained model using patient-specific weights for every new test patient to improve diagnostic performance. First, the supervised learning method is used to train a machine learning model. Next, using the intermediate features of the trained model obtained through the test patient data, the similarity between the test patient data and each training patient's data is defined to determine the weight of each training patient to be used in the following fine-tuning. Finally, we fine-tune all parameters in the pretrained model with training data and patient weights. In the experiment, the leave-one-patient-out method is used to evaluate the proposed method, and the results show improved classification accuracy for every test patient, with an average improvement of more than 10%.