AITopics | medgemma

Collaborating Authors

medgemma

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Comparative Evaluation of Generative AI Models for Chest Radiograph Report Generation in the Emergency Department

Lim, Woo Hyeon, Lee, Ji Young, Lee, Jong Hyuk, Kim, Saehoon, Kim, Hyungjin

arXiv.org Artificial IntelligenceDec-2-2025

Purpose: To benchmark open-source or commercial medical image-specific VLMs against real-world radiologist-written reports. Methods: This retrospective study included adult patients who presented to the emergency department between January 2022 and April 2025 and underwent same-day CXR and CT for febrile or respiratory symptoms. Reports from five VLMs (AIRead, Lingshu, MAIRA-2, MedGemma, and MedVersa) and radiologist-written reports were randomly presented and blindly evaluated by three thoracic radiologists using four criteria: RADPEER, clinical acceptability, hallucination, and language clarity. Comparative performance was assessed using generalized linear mixed models, with radiologist-written reports treated as the reference. Finding-level analyses were also performed with CT as the reference. Results: A total of 478 patients (median age, 67 years [interquartile range, 50-78]; 282 men [59.0%]) were included. AIRead demonstrated the lowest RADPEER 3b rate (5.3% [76/1434] vs. radiologists 13.9% [200/1434]; P<.001), whereas other VLMs showed higher disagreement rates (16.8-43.0%; P<.05). Clinical acceptability was the highest with AIRead (84.5% [1212/1434] vs. radiologists 74.3% [1065/1434]; P<.001), while other VLMs performed worse (41.1-71.4%; P<.05). Hallucinations were rare with AIRead, comparable to radiologists (0.3% [4/1425]) vs. 0.1% [1/1425]; P=.21), but frequent with other models (5.4-17.4%; P<.05). Language clarity was higher with AIRead (82.9% [1189/1434]), Lingshu (88.0% [1262/1434]), and MedVersa (88.4% [1268/1434]) compared with radiologists (78.1% [1120/1434]; P<.05). Sensitivity varied substantially across VLMs for the common findings: AIRead, 15.5-86.7%; Lingshu, 2.4-86.7%; MAIRA-2, 6.0-72.0%; MedGemma, 4.8-76.7%; and MedVersa, 20.2-69.3%. Conclusion: Medical VLMs for CXR report generation exhibited variable performance in report quality and diagnostic measures.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.00271

Country: Asia > South Korea (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.98)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.40)

Add feedback

How many patients could we save with LLM priors?

Arai, Shota, Selby, David, Vargo, Andrew, Vollmer, Sebastian

arXiv.org Artificial IntelligenceNov-21-2025

Imagine a world where clinical trials need far fewer patients to achieve the same statistical power, thanks to the knowledge encoded in large language models (LLMs). We present a novel framework for hierarchical Bayesian modeling of adverse events in multi-center clinical trials, leveraging LLM-informed prior distributions. Unlike data augmentation approaches that generate synthetic data points, our methodology directly obtains parametric priors from the model. Our approach systematically elicits informative priors for hyperparameters in hierarchical Bayesian models using a pre-trained LLM, enabling the incorporation of external clinical expertise directly into Bayesian safety modeling. Through comprehensive temperature sensitivity analysis and rigorous cross-validation on real-world clinical trial data, we demonstrate that LLM-derived priors consistently improve predictive performance compared to traditional meta-analytical approaches. This methodology paves the way for more efficient and expert-informed clinical trial design, enabling substantial reductions in the number of patients required to achieve robust safety assessment and with the potential to transform drug safety monitoring and regulatory decision making.

large language model, llama 3, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.0425

Country: Asia (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.47)

Add feedback

Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

Gosai, Advait, Kavishwar, Arun, McNamara, Stephanie L., Samineni, Soujanya, Umeton, Renato, Chowdhury, Alexander, Lotter, William

arXiv.org Artificial IntelligenceNov-20-2025

Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model's spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5's predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, but showed improvements when provided examples through few shot prompting. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.18015

Country: North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

One VLM, Two Roles: Stage-Wise Routing and Specialty-Level Deployment for Clinical Workflows

Vassef, Shayan, Shimegekar, Soorya Ram, Goyal, Abhay, Saha, Koustuv, Zonooz, Pi, Kumar, Navin

arXiv.org Artificial IntelligenceNov-18-2025

Clinical ML workflows are often fragmented and inefficient: triage, task selection, and model deployment are handled by a patchwork of task-specific networks. These pipelines are rarely aligned with data-science practice, reducing efficiency and increasing operational cost. They also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. We present a framework that employs a single vision-language model (VLM) in two complementary, modular roles. First (Solution 1): the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card ID). Reliability is improved by (i) stage-wise prompts enabling early termination via "None"/"Other" and (ii) a calibrated top-2 answer selector with a stage-wise cutoff. This raises routing accuracy by +9 and +11 percentage points on the training and held-out splits, respectively, compared with a baseline router, and improves held-out calibration (lower Expected Calibration Error, ECE). Second (Solution 2): we fine-tune the same VLM on specialty-specific datasets so that one model per specialty covers multiple downstream tasks, simplifying deployment while maintaining performance. Across gastroenterology, hematology, ophthalmology, pathology, and radiology, this single-model deployment matches or approaches specialized baselines. Together, these solutions reduce data-science effort through more accurate selection, simplify monitoring and maintenance by consolidating task-specific models, and increase transparency via per-stage justifications and calibrated thresholds. Each solution stands alone, and in combination they offer a practical, modular path from triage to deployment.

classification, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.16839

Country: North America > United States > Illinois (0.28)

Genre:

Research Report (1.00)
Workflow (0.83)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.95)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Evaluating Prompting Strategies with MedGemma for Medical Order Extraction

Balachandran, Abhinand, Durgapraveen, Bavana, Sudhagar, Gowsikkan Sikkan, S, Vidhya Varshany J, Rajkumar, Sriram

arXiv.org Artificial IntelligenceNov-14-2025

The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to "overthinking" and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.

extraction, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.10583

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area (0.70)
Health & Medicine > Diagnostic Medicine > Lab Test (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.47)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.35)

Add feedback

Google-MedGemma Based Abnormality Detection in Musculoskeletal radiographs

Maity, Soumyajit, Kamboj, Pranjal, Maity, Sneha, Singh, Rajat, Chatterjee, Sankhadeep

arXiv.org Artificial IntelligenceNov-11-2025

This paper proposes a MedGemma-based framework for automatic abnormality detection in musculoskeletal radiographs. Departing from conventional autoencoder and neural network pipelines, the proposed method leverages the MedGemma foundation model, incorporating a SigLIP-derived vision encoder pretrained on diverse medical imaging modalities. Preprocessed X-ray images are encoded into high-dimensional embeddings using the MedGemma vision backbone, which are subsequently passed through a lightweight multilayer perceptron for binary classification. Experimental assessment reveals that the MedGemma-driven classifier exhibits strong performance, exceeding conventional convolutional and autoencoder-based metrics. Additionally, the model leverages MedGemma's transfer learning capabilities, enhancing generalization and optimizing feature engineering. The integration of a modern medical foundation model not only enhances representation learning but also facilitates modular training strategies such as selective encoder block unfreezing for efficient domain adaptation. The findings suggest that MedGemma-powered classification systems can advance clinical radiograph triage by providing scalable and accurate abnormality detection, with potential for broader applications in automated medical image analysis. Keywords: Google MedGemma, MURA, Medical Image, Classification.

abnormality detection, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.056

Country: North America > United States > Texas (0.15)

Genre: Research Report (0.70)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

Zun, Lee Qi, Halim, Mohamad Zulhilmi Bin Abdul, Fye, Goh Man

arXiv.org Artificial IntelligenceNov-10-2025

Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.15418

Country: Asia > Malaysia (0.51)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area (0.89)
Health & Medicine > Diagnostic Medicine > Imaging (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Renzulli, Riccardo, Lepoutre, Colas, Cassano, Enrico, Grangetto, Marco

arXiv.org Artificial IntelligenceOct-31-2025

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2510.26411

Country:

Europe > Italy (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

Jeon, Sohyeon, Lee, Hyung-Chul

arXiv.org Artificial IntelligenceOct-28-2025

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge. In particular, uncertainty calibratio n and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation. This study applies a behavioral and metacognitive analytic approach using an expert - validated dataset, systematically comparing two representati ve LLMs -- one general and one domain - specialized -- across three prompt strategies. We analyze both cognitive adaptation and calibration error using metrics: Expected Calibration Error (ECE) and a baseline - normalized Relative Calibration Error (RCE) that enable s reliable cross - model comparison. Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role - playing conditions, with calibration error persisting above clinically relevant thresholds. These findings und erscore the need for improved calibration, transparent code, and strategic prompt engineering for the development of reliable and explainable medical AI.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.19139

Genre:

Research Report > Strength High (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Diagnostic Medicine (0.93)
Health & Medicine > Therapeutic Area > Neurology (0.64)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Simulating Clinical AI Assistance using Multimodal LLMs: A Case Study in Diabetic Retinopathy

Barakat, Nadim, Lotter, William

arXiv.org Artificial IntelligenceSep-17-2025

Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening. Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility. Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types. Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model. Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs. MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity. Both models adjusted predictions based on simulated AI inputs, but GPT-4o's performance collapsed with incorrect ones, whereas MedGemma remained more stable. In actual collaboration, GPT-4o achieved strong results when guided by MedGemma's descriptive outputs, even without direct image access (AUROC up to 0.96). These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations. Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows.

large language model, machine learning, medgemma, (18 more...)

arXiv.org Artificial Intelligence

2509.13234

Country: North America > United States (0.55)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.78)
Government > Regional Government > North America Government > United States Government > FDA (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback