Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

Wang, Shansong, Jin, Zhecheng, Hu, Mingzhe, Safari, Mojtaba, Zhao, Feng, Chang, Chih-Wei, Qiu, Richard LJ, Roper, Justin, Yu, David S., Yang, Xiaofeng

Jul-1-2025–arXiv.org Artificial Intelligence

CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jul-1-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)
- Asia
  - Middle East > Republic of Türkiye
    - Ankara Province > Ankara (0.04)
  - China > Guangdong Province
    - Shantou (0.04)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Education (1.00)
- Health & Medicine
  - Nuclear Medicine (1.00)
  - Health Care Technology (1.00)
  - Diagnostic Medicine > Imaging (1.00)
  - Therapeutic Area
    - Infections and Infectious Diseases (1.00)
    - Ophthalmology/Optometry (1.00)
    - Dermatology (1.00)
    - Gastroenterology (1.00)
    - Cardiology/Vascular Diseases (1.00)
    - Neurology (1.00)
    - Pulmonary/Respiratory Diseases (1.00)
    - Immunology (0.93)
    - Oncology
      - Carcinoma (0.46)
      - Colorectal Cancer (0.46)
      - Breast Cancer (0.46)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language (1.00)
    - Machine Learning
      - Neural Networks > Deep Learning (0.92)
      - Performance Analysis (0.92)
      - Statistical Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found