Advancing Medical Representation Learning Through High-Quality Data

Baghbanzadeh, Negin, Fallahpour, Adibvafa, Parhizkar, Yasaman, Ogidi, Franklin, Roy, Shuvendu, Ashkezari, Sajad, Khazaie, Vahid Reza, Colacci, Michael, Etemad, Ali, Afkanpour, Arash, Dolatabadi, Elham

Mar-18-2025–arXiv.org Artificial Intelligence

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

caption, dataset, open-pmc, (14 more...)

arXiv.org Artificial Intelligence

Mar-18-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Spain (0.14)
- North America > Canada
  - Ontario > Toronto (0.14)

Industry:
- Health & Medicine
  - Diagnostic Medicine > Imaging (1.00)
  - Nuclear Medicine (0.71)
  - Therapeutic Area
    - Dermatology (0.69)
    - Oncology (1.00)
    - Ophthalmology/Optometry (0.94)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.48)
    - Natural Language > Large Language Model (0.52)
    - Vision (1.00)
  - Data Science (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)