clinical reasoning
MCP-AI: Protocol-Driven Intelligence Framework for Autonomous Reasoning in Healthcare
ElSayed, Zag, Erickson, Craig, Pedapati, Ernest
Healthcare AI systems have historically faced challenges in merging contextual reasoning, long-term state management, and human-verifiable workflows into a cohesive framework. This paper introduces a completely innovative architecture and concept: combining the Model Context Protocol (MCP) with a specific clinical application, known as MCP-AI. This integration allows intelligent agents to reason over extended periods, collaborate securely, and adhere to authentic clinical logic, representing a significant shift away from traditional Clinical Decision Support Systems (CDSS) and prompt-based Large Language Models (LLMs). As healthcare systems become more complex, the need for autonomous, context-aware clinical reasoning frameworks has become urgent. We present MCP-AI, a novel architecture for explainable medical decision-making built upon the Model Context Protocol (MCP) a modular, executable specification for orchestrating generative and descriptive AI agents in real-time workflows. Each MCP file captures clinical objectives, patient context, reasoning state, and task logic, forming a reusable and auditable memory object. Unlike conventional CDSS or stateless prompt-based AI systems, MCP-AI supports adaptive, longitudinal, and collaborative reasoning across care settings. MCP-AI is validated through two use cases: (1) diagnostic modeling of Fragile X Syndrome with comorbid depression, and (2) remote coordination for Type 2 Diabetes and hypertension. In either scenario, the protocol facilitates physician-in-the-loop validation, streamlines clinical processes, and guarantees secure transitions of AI responsibilities between healthcare providers. The system connects with HL7/FHIR interfaces and adheres to regulatory standards, such as HIPAA and FDA SaMD guidelines. MCP-AI provides a scalable basis for interpretable, composable, and safety-oriented AI within upcoming clinical environments.
Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
Gong, Haozhen, Ji, Xiaozhong, Liu, Yuansen, Wu, Wenbin, Yan, Xiaoxiao, Liu, Jingjing, Wu, Kai, Pan, Jiazhen, Jian, Bailiang, Zhang, Jiangning, Hu, Xiaobin, Li, Hongwei Bran
MLLMs MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual-reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.
KRAL: Knowledge and Reasoning Augmented Learning for LLM-assisted Clinical Antimicrobial Therapy
Li, Zhe, Qiu, Yehan, Chen, Yujie, Zhou, Xiang
Clinical antimicrobial therapy requires the dynamic integration of pathogen profiles, host factors, pharmacological properties of antimicrobials, and the severity of infection.This complexity imposes fundamental limitations on the applicability of Large Language Models (LLMs) in high-stakes clinical decision-making including knowledge gaps, data privacy concerns, high deployment costs, and limited reasoning capabilities. This is the first author footnote. A hierarchical evaluation employing diverse teacher-model proxies reduces assessment costs, while modular interface design facilitates seamless system updates. Experimental results demonstrate that KRAL significantly outperforms traditional Retrieval-Augmented Generation (RAG) and Supervised Fine-Tuning (SFT) methods. It improves knowledge question-answering capability (Accuracy@1 on the external open-source benchmark MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG) and reasoning capability (Pass@1 on the external benchmark PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG), achieved at 20% of SFT's long-term training costs. This establishes KRAL as an effective solution for enhancing local LLMs' clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support. Introduction Antimicrobial therapy constitutes a cornerstone of modern clinical practice. The formulation of an effective regimen necessitates the integration of pathogen-specific factors, host characteristics, pharmacokinetic pharma-codynamic (PK/PD) properties of antimicrobials, and infection severity, all of which are dynamic and interrelated. This places significant cognitive load on clinicians, especially for non-infectious disease specialists or in situations where pathogens are unknown and time is limited, which may result in sub-optimal prescribing decisions, thereby increasing the likelihood of therapeutic failure, antimicrobial toxicity, and the emergence of multidrug-resistant (MDR) pathogens. Large language models (LLMs) have recently emerged as promising tools for enhancing clinical decision support systems (CDSS), owing to their advanced natural language understanding and generation capabilities. Nevertheless, the direct deployment of general-purpose LLMs in high-stakes clinical domains such as antimicrobial therapy is fraught with limitations, including: Knowledge bias: Medical content constitutes <0.3 % of the pre-training corpora[1] in mainstream LLMs (e.g., GPT-3), resulting in limited coverage of rare or emerging pathogens[2-4], outdated guideline adherence[5-9](Appendix A1), and suboptimal performance on atypical presentations. 2 Data privacy and compliance risks: The use of closed-source, cloud-based LLMs (e.g., GPT-4) for processing unencrypted protected health information (PHI) may violate HIPAA/GDPR-equivalent regulations, even under private deployment scenarios if online guideline updates are required[10-12].
Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts
Zhu, Siyu, Bian, Mouxiao, Xie, Yue, Tang, Yongyu, Yu, Zhikang, Li, Tianbin, Chen, Pengcheng, Han, Bing, Xu, Jie, Dong, Xiaoyan
With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.
Benchmarking GPT-5 for biomedical natural language processing
Hou, Yu, Zhan, Zaifu, Zeng, Min, Wu, Yifan, Zhou, Shuang, Zhang, Rui
Biomedical literature and clinical narratives pose multifaceted challenges for natural language understanding, from precise entity extraction and document synthesis to multi-step diagnostic reasoning. This study extends a unified benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across five core biomedical NLP tasks: named entity recognition, relation extraction, multi-label document classification, summarization, and simplification, and nine expanded biomedical QA datasets covering factual knowledge, clinical reasoning, and multimodal visual understanding. Using standardized prompts, fixed decoding parameters, and consistent inference pipelines, we assessed model performance, latency, and token-normalized cost under official pricing. GPT-5 consistently outperformed GPT-4o, with the largest gains on reasoning-intensive datasets such as MedXpertQA and DiagnosisArena and stable improvements in multimodal QA. In core tasks, GPT-5 achieved better chemical NER and ChemProt scores but remained below domain-tuned baselines for disease NER and summarization. Despite producing longer outputs, GPT-5 showed comparable latency and 30 to 50 percent lower effective cost per correct prediction. Fine-grained analyses revealed improvements in diagnosis, treatment, and reasoning subtypes, whereas boundary-sensitive extraction and evidence-dense summarization remain challenging. Overall, GPT-5 approaches deployment-ready performance for biomedical QA while offering a favorable balance of accuracy, interpretability, and economic efficiency. The results support a tiered prompting strategy: direct prompting for large-scale or cost-sensitive applications, and chain-of-thought scaffolds for analytically complex or high-stakes scenarios, highlighting the continued need for hybrid solutions where precision and factual fidelity are critical.
Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning
Liu, Chi, Li, Derek, Shu, Yan, Chen, Robin, Duan, Derek, Fang, Teng, Dai, Bryan
While large language models show promise in medical applications, achieving expert-level clinical reasoning remains challenging due to the need for both accurate answers and transparent reasoning processes. To address this challenge, we introduce Fleming-R1, a model designed for verifiable medical reasoning through three complementary innovations. First, our Reasoning-Oriented Data Strategy (RODS) combines curated medical QA datasets with knowledge-graph-guided synthesis to improve coverage of underrepresented diseases, drugs, and multi-hop reasoning chains. Second, we employ Chain-of-Thought (CoT) cold start to distill high-quality reasoning trajectories from teacher models, establishing robust inference priors. Third, we implement a two-stage Reinforcement Learning from Verifiable Rewards (RLVR) framework using Group Relative Policy Optimization, which consolidates core reasoning skills while targeting persistent failure modes through adaptive hard-sample mining. Across diverse medical benchmarks, Fleming-R1 delivers substantial parameter-efficient improvements: the 7B variant surpasses much larger baselines, while the 32B model achieves near-parity with GPT-4o and consistently outperforms strong open-source alternatives. These results demonstrate that structured data design, reasoning-oriented initialization, and verifiable reinforcement learning can advance clinical reasoning beyond simple accuracy optimization. We release Fleming-R1 publicly to promote transparent, reproducible, and auditable progress in medical AI, enabling safer deployment in high-stakes clinical environments.
Leveraging Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI
Abtahi, Farhad, Astaraki, Mehdi, Seoane, Fernando
Bias in medical artificial intelligence is conventionally viewed as a defect requiring elimination. However, human reasoning inherently incorporates biases shaped by education, culture, and experience, suggesting their presence may be inevitable and potent ially valuable. We propose MEDLEY (Medical Ensemble Diagnostic system with Leveraged diversit Y), a conceptual framework that orchestrates multiple AI models while preserving their diverse outputs rather than collapsing them into a consensus. Unlike traditional approaches that suppress disagreement, MEDLEY documents model - specific biases as potential strengths and treats hallucinations as provisional hypotheses for clinician verification. A proof - of - concept demonstrator was developed using over 30 large language models, creating a minimum viable product that preserved both consensus and minority views in synthetic cases, making diagnostic uncertainty and latent biases transparent for clinical oversight. While not yet a validated clini cal tool, the demonstration illustrates how structured diversity can enhance medical reasoning under clinician supervision. By reframing AI imperfection as a resource, MEDLEY offers a paradigm shift that opens new regulatory, ethical, and innovation pathwa ys for developing trustworthy medical AI systems.
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications
Wang, Wenxuan, Ma, Zizhan, Ding, Meidan, Zheng, Shiyi, Liu, Shengyuan, Liu, Jie, Ji, Jiaming, Chen, Wenting, Li, Xiang, Shen, Linlin, Yuan, Yixuan
The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.