biomedclip
Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis
Tong, Ran, Liu, Jiaqi, Wang, Tong, Hu, Xin, Liu, Su, Wang, Lanruo, Xu, Jiexi
The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.
- Asia > China > Guangdong Province > Shenzhen (0.26)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
- North America > United States > California > Orange County > Irvine (0.14)
- (2 more...)
Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation
Li, Frank, Dapamede, Theo, Chavoshi, Mohammadreza, Jeon, Young Seok, Khosravi, Bardia, Dere, Abdulhameed, Brown-Mulry, Beatrice, Isaac, Rohan Satya, Mansuri, Aawez, Sanyika, Chiratidzo, Newsome, Janice, Purkayastha, Saptarshi, Banerjee, Imon, Trivedi, Hari, Gichoya, Judy
Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.
- North America > United States > Indiana > Marion County > Indianapolis (0.04)
- Africa > Nigeria > Kwara State > Ilorin (0.04)
- North America > United States > Minnesota > Olmsted County > Rochester (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Curia: A Multi-Modal Foundation Model for Radiology
Dancette, Corentin, Khlaut, Julien, Saporta, Antoine, Philippe, Helene, Ferreres, Elodie, Callard, Baptiste, Danielou, Théo, Alberge, Léo, Machado, Léo, Tordjman, Daniel, Dupuis, Julie, Floch, Korentin Le, Terrail, Jean Du, Moshiri, Mariam, Dercle, Laurent, Boeken, Tom, Gregory, Jules, Ronot, Maxime, Legou, François, Roux, Pascal, Sapoval, Marc, Manceron, Pierre, Hérent, Paul
Radiology is at the center of many medical specialties, which rely on radiologists' interpretation of images from various modalities, including CT, MRI, ultrasound, and X-ray [1]. The analysis of these images is crucial for detecting and characterizing medical conditions, quantifying disease progression, and monitoring treatment efficacy across a broad spectrum of diseases. AI has the potential to enhance radiology workflows and improve radiologists' efficiency, particularly for labor-intensive tasks such as image segmentation, or specialized and/or complex tasks which are prone to inter-reader variability [2, 3]. To date, the dominant paradigm in radiological AI development has involved training specialized models for individual tasks such as segmentation, abnormality detection (e.g., tumor detection), or pathology classification. However, this "one-task, one-model" approach is exceptionally resource-intensive, as it necessitates the curation and manual annotation of large, task-specific datasets for each modality and clinical application [4, 5]. It is potentially one of the bottlenecks in moving AI radiology models into the clinical workflow. Foundation models (FM) represent a significant paradigm shift in the field of AI.
- North America > United States > South Carolina > Charleston County > Charleston (0.14)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)
Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)
Zhou, Yang, Quek, Chrystie Wan Ning, Zhou, Jun, Wang, Yan, Bai, Yang, Ke, Yuhe, Yao, Jie, Gutierrez, Laura, Teo, Zhen Ling, Ting, Darren Shu Jeng, Soetikno, Brian T., Nielsen, Christopher S., Elze, Tobias, Li, Zengxiang, Dinh, Linh Le, Cheng, Lionel Tim-Ee, Anh, Tran Nguyen Tuan, Cheng, Chee Leong, Wong, Tien Yin, Liu, Nan, Tan, Iain Beehuat, Lim, Tony Kiat Hon, Goh, Rick Siow Mong, Liu, Yong, Ting, Daniel Shu Wei
Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.
- Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.14)
- Asia > Singapore > Central Region > Singapore (0.05)
- Asia > China > Beijing > Beijing (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.94)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Comparison of ConvNeXt and Vision-Language Models for Breast Density Assessment in Screening Mammography
Molina-Román, Yusdivia, Gómez-Ortiz, David, Menasalvas-Ruiz, Ernestina, Tamez-Peña, José Gerardo, Santos-Díaz, Alejandro
--Mammographic breast density classification is essential for cancer risk assessment but remains challenging due to subjective interpretation and inter-observer variability. This study compares multimodal and CNN-based methods for automated classification using the BI-RADS system, evaluating BioMedCLIP and ConvNeXt across three learning scenarios: zero-shot classification, linear probing with textual descriptions, and fine-tuning with numerical labels. Results show that zero-shot classification achieved modest performance, while the fine-tuned ConvNeXt model outperformed the BioMedCLIP linear probe. Although linear probing demonstrated potential with pretrained embeddings, it was less effective than full fine-tuning. These findings suggest that despite the promise of multimodal learning, CNN-based models with end-to-end fine-tuning provide stronger performance for specialized medical imaging. The study underscores the need for more detailed textual representations and domain-specific adaptations in future radiology applications. Accurate breast density classification plays a critical role in assessing breast cancer risk.
- North America > Mexico > Nuevo León > Monterrey (0.05)
- Europe > Spain > Galicia > Madrid (0.05)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Singapore (0.04)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.73)
Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models
Peng, Wei, Liu, Kang, Hu, Jianchen, Zhang, Meng
Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14\% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06\% in base classes and 75.97\% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20\%, 3.78\%, and 8.04\%, respectively. Our code are available at \underline{https://github.com/Kanyooo/Biomed-DPT}.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (7 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis
Li, Frank, Trivedi, Hari, Khosravi, Bardia, Dapamede, Theo, Chavoshi, Mohammadreza, Dere, Abdulhameed, Isaac, Rohan Satya, Mansuri, Aawez, Newsome, Janice, Purkayastha, Saptarshi, Gichoya, Judy
Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology.
- North America > United States > Indiana > Marion County > Indianapolis (0.04)
- Africa > Nigeria > Kwara State > Ilorin (0.04)
- North America > United States > Minnesota > Olmsted County > Rochester (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
A Lightweight Large Vision-language Model for Multimodal Medical Images
Alsinglawi, Belal, McCarthy, Chris, Webb, Sara, Fluke, Christopher, Saidy, Navid Toosy
Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.
- Oceania > Australia > Victoria > Melbourne (0.04)
- Oceania > Australia > Queensland > Brisbane (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision > Image Understanding (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models
Koleilat, Taha, Asgariandehkordi, Hojat, Rivaz, Hassan, Xiao, Yiming
Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code will be publicly available at https://github.com/HealthX-Lab/BiomedCoOp.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- (2 more...)
Navigating Data Scarcity using Foundation Models: A Benchmark of Few-Shot and Zero-Shot Learning Approaches in Medical Imaging
Woerner, Stefano, Baumgartner, Christian F.
Data scarcity is a major limiting factor for applying modern machine learning techniques to clinical tasks. Although sufficient data exists for some well-studied medical tasks, there remains a long tail of clinically relevant tasks with poor data availability. Recently, numerous foundation models have demonstrated high suitability for few-shot learning (FSL) and zero-shot learning (ZSL), potentially making them more accessible to practitioners. However, it remains unclear which foundation model performs best on FSL medical image analysis tasks and what the optimal methods are for learning from limited data. We conducted a comprehensive benchmark study of ZSL and FSL using 16 pretrained foundation models on 19 diverse medical imaging datasets. Our results indicate that BiomedCLIP, a model pretrained exclusively on medical data, performs best on average for very small training set sizes, while very large CLIP models pretrained on LAION-2B perform best with slightly more training samples. However, simply fine-tuning a ResNet-18 pretrained on ImageNet performs similarly with more than five training examples per class. Our findings also highlight the need for further research on foundation models specifically tailored for medical applications and the collection of more datasets to train these models.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Europe > Switzerland > Lucerne > Lucerne (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)