AITopics | medical llm

Collaborating Authors

medical llm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Neural Information Processing SystemsFeb-11-2026, 10:35:31 GMT

However, there is little to no understanding of the notion of medical safety in the context of LLMs, let alone how to evaluate and improve it. To address this gap, we first define the notion of medical safety in LLMs based on the Principles of Medical Ethics set forth by the American Medical Association.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Virginia > Albemarle County > Charlottesville (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.93)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology

Yu, Deshui, Wang, Yizhi, Jin, Saihui, Zhu, Taojie, Zeng, Fanyi, Qian, Wen, Huang, Zirui, Ouyang, Jingli, Li, Jiameng, Song, Zhen, Guan, Tian, He, Yonghong

arXiv.org Artificial IntelligenceOct-13-2025

Large language models (LLMs) excel on general tasks yet still hallucinate in high-barrier domains such as pathology. Prior work often relies on domain fine-tuning, which neither expands the knowledge boundary nor enforces evidence-grounded constraints. We therefore build a pathology vector database covering 28 subfields and 1.53 million paragraphs, and present YpathRAG, a pathology-oriented RAG framework with dual-channel hybrid retrieval (BGE-M3 dense retrieval coupled with vocabulary-guided sparse retrieval) and an LLM-based supportive-evidence judgment module that closes the retrieval-judgment-generation loop. We also release two evaluation benchmarks, YpathR and YpathQA-M. On YpathR, YpathRAG attains Recall@5 of 98.64%, a gain of 23 percentage points over the baseline; on YpathQA-M, a set of the 300 most challenging questions, it increases the accuracies of both general and medical LLMs by 9.0% on average and up to 15.6%. These results demonstrate improved retrieval quality and factual reliability, providing a scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.08603

Country: Asia > China (0.15)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

3ac952d0264ef7a505393868a70a46b6-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-9-2025, 23:44:32 GMT

llm, medical safety, safety, (14 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Virginia > Albemarle County > Charlottesville (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Law (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government (1.00)
(9 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Advances in Large Language Models for Medicine

Kan, Zhiyu, Gan, Wensheng, Qi, Zhenlian, Yu, Philip S.

arXiv.org Artificial IntelligenceSep-24-2025

Artificial intelligence (AI) technology has advanced rapidly in recent years, with large language models (LLMs) emerging as a significant breakthrough. LLMs are increasingly making an impact across various industries, with the medical field standing out as the most prominent application area. This paper systematically reviews the up-to-date research progress of LLMs in the medical field, providing an in-depth analysis of training techniques for large medical models, their adaptation in healthcare settings, related applications, as well as their strengths and limitations. Furthermore, it innovatively categorizes medical LLMs into three distinct types based on their training methodologies and classifies their evaluation approaches into two categories. Finally, the study proposes solutions to existing challenges and outlines future research directions based on identified issues in the field of medical LLMs. By systematically reviewing previous and advanced research findings, we aim to highlight the necessity of developing medical LLMs, provide a deeper understanding of their current state of development, and offer clear guidance for subsequent research.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.1869

Country:

Europe (0.67)
Asia > China (0.28)
North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > New Finding (0.92)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
(9 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SparseDoctor: Towards Efficient Chat Doctor with Mixture of Experts Enhanced Large Language Models

Zhang, Jianbin, Zhu, Yulin, Lo, Wai Lun, Hsung, Richard Tai-Chiu, Tsang, Harris Sik-Ho, Zhou, Kai

arXiv.org Artificial IntelligenceSep-23-2025

Large language models (LLMs) have achieved great success in medical question answering and clinical decision-making, promoting the efficiency and popularization of the personalized virtual doctor in society. However, the traditional fine-tuning strategies on LLM require the updates of billions of parameters, substantially increasing the training cost, including the training time and utility cost. To enhance the efficiency and effectiveness of the current medical LLMs and explore the boundary of the representation capability of the LLMs on the medical domain, apart from the traditional fine-tuning strategies from the data perspective (i.e., supervised fine-tuning or reinforcement learning from human feedback), we instead craft a novel sparse medical LLM named SparseDoctor armed with contrastive learning enhanced LoRA-MoE (low rank adaptation-mixture of experts) architecture. To this end, the crafted automatic routing mechanism can scientifically allocate the computational resources among different LoRA experts supervised by the contrastive learning. Additionally, we also introduce a novel expert memory queue mechanism to further boost the efficiency of the overall framework and prevent the memory overflow during training. We conduct comprehensive evaluations on three typical medical benchmarks: CMB, CMExam, and CMMLU-Med. Experimental results demonstrate that the proposed LLM can consistently outperform the strong baselines such as the HuatuoGPT series.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.14269

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Health Care Technology > Telehealth (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs

Ding, Hongxin, Huang, Baixiang, Fang, Yue, Liao, Weibin, Jiang, Xinke, Li, Zheng, Zhao, Junfeng, Wang, Yasha

arXiv.org Artificial IntelligenceAug-20-2025

Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a reinforcement learning (RL) framework that transitions medical LLMs toward a proactive paradigm, equipping them with the ability to ask clinically valuable questions before decision-making. At the core of ProMed is the Shapley Information Gain (SIG) reward, which quantifies the clinical utility of each question by combining the amount of newly acquired information with its contextual importance, estimated via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search (MCTS) to construct high-reward interaction trajectories to supervise the model, and (2) SIG-Augmented Policy Optimization, which integrates SIG and enhances RL with a novel SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions for targeted optimization. Extensive experiments on two newly curated partial-information medical benchmarks demonstrate that ProMed significantly outperforms state-of-the-art methods by an average of 6.29% and delivers a 54.45% gain over the reactive paradigm, while also generalizing robustly to out-of-domain cases.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2508.13514

Country:

Asia (0.46)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area > Rheumatology (0.94)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation

Zhang, Xiechi, Ouyang, Zetian, Wang, Linlin, de Melo, Gerard, Cao, Zhu, Wang, Xiaoling, Zhang, Ya, Wang, Yanfeng, He, Liang

arXiv.org Artificial IntelligenceMay-20-2025

With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2505.11887

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Maryland > Baltimore (0.04)
Europe > United Kingdom (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.94)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.68)
Health & Medicine > Diagnostic Medicine > Imaging (0.68)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Large Language Models for Cancer Communication: Evaluating Linguistic Quality, Safety, and Accessibility in Generative AI

Saha, Agnik, Churchill, Victoria, Rodriguez, Anny D., Kursuncu, Ugur, Idris, Muhammed Y.

arXiv.org Artificial IntelligenceMay-19-2025

Effective communication about breast and cervical cancers remains a persistent health challenge, with significant gaps in public understanding of cancer prevention, screening, and treatment, potentially leading to delayed diagnoses and inadequate treatments. This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information to support patient understanding. We evaluated five general-purpose and three medical LLMs using a mixed-methods evaluation framework across linguistic quality, safety and trustworthiness, and communication accessibility and affectiveness. Our approach utilized quantitative metrics, qualitative expert ratings, and statistical analysis using Welch's ANOVA, Games-Howell, and Hedges' g. Our results show that general-purpose LLMs produced outputs of higher linguistic quality and affectiveness, while medical LLMs demonstrate greater communication accessibility. However, medical LLMs tend to exhibit higher levels of potential harm, toxicity, and bias, reducing their performance in safety and trustworthiness. Our findings indicate a duality between domain-specific knowledge and safety in health communications. The results highlight the need for intentional model design with targeted improvements, particularly in mitigating harm and bias, and improving safety and affectiveness. This study provides a comprehensive evaluation of LLMs for cancer communication, offering critical insights for improving AI-generated health content and informing future development of accurate, safe, and accessible digital health tools.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.10472

Country: North America > United States (0.68)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.40)

Add feedback

Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages

Salmè, Marco, Sicilia, Rosa, Soda, Paolo, Guarrasi, Valerio

arXiv.org Artificial IntelligenceMay-5-2025

--The integration of artificial intelligence in healthcare has opened new horizons for improving medical diagnostics and patient care. However, challenges persist in developing systems capable of generating accurate and contextually relevant radiology reports, particularly in low-resource languages. In this study, we present a comprehensive benchmark to evaluate the performance of instruction-tuned Vision-Language Models (VLMs) in the specialized task of radiology report generation across three low-resource languages: Italian, German, and Spanish. Employing the LLaV A architectural framework, we conducted a systematic evaluation of pre-trained models utilizing general datasets, domain-specific datasets, and low-resource language-specific datasets. In light of the unavailability of models that possess prior knowledge of both the medical domain and low-resource languages, we analyzed various adaptations to determine the most effective approach for these contexts. The results revealed that language-specific models substantially outperformed both general and domain-specific models in generating radiology reports, emphasizing the critical role of linguistic adaptation. Additionally, models fine-tuned with medical terminology exhibited enhanced performance across all languages compared to models with generic knowledge, highlighting the importance of domain-specific training. Our findings highlight the importance of tailored language and domain-specific training for improving the quality and accuracy of radiological reports in multilingual settings. This research not only advances our understanding of VLMs adaptability in healthcare but also points to significant avenues for future investigations into model tuning and language-specific adaptations. I NTRODUCTION Foundation Models (FMs) [1] represent a groundbreaking advancement in artificial intelligence, bringing significant improvements across numerous disciplines, including medicine [2].

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.01096

Country: Europe (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies

Jiang, Luyi, Chen, Jiayuan, Lu, Lu, Peng, Xinwei, Liu, Lihao, He, Junjun, Xu, Jie

arXiv.org Artificial IntelligenceMar-10-2025

In recent years, large language models (LLMs), empowered by massive text corpora and deep learning techniques, have demonstrated breakthrough advancements in cross-domain knowledge transfer and human-machine dialogue interactions [1]. Within the healthcare domain, LLMs are increasingly deployed across nine core application scenarios, including intelligent diagnosis, personalized treatment, and drug discovery, garnering significant attention from both academia and industry [2, 3]. A particularly important area of focus is the development and evaluation of Chinese medical LLMs, which face unique challenges due to the specialized nature of medical knowledge and the high-stakes implications of clinical decision-making. Hence, ensuring the reliability and safety of these models has become critical, necessitating rigorous evaluation frameworks [4]. Current research on medical LLMs evaluation exhibits two predominant trends. On one hand, general-domain benchmarks (e.g., HELM [5], MMLU [6]) assess foundational model capabilities through medical knowledge tests. On the other hand, specialized medical evaluation systems (e.g., MedQA [7], C-Eval-Medical [8]) emphasize clinical reasoning and ethical compliance. Notably, the MedBench framework [9], jointly developed by institutions including Shanghai AI Laboratory, has emerged as the most influential benchmark for Chinese medical LLMs. By establishing a standardized evaluation system spanning five dimensions--medical language comprehension, complex reasoning, and safety ethics--it has attracted participation from hundreds of research teams.

arxiv, examination, language model, (14 more...)

arXiv.org Artificial Intelligence

2503.07306

Country: Asia > China > Shanghai > Shanghai (0.25)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback