Advancing Medical Representation Learning Through High-Quality Data

Baghbanzadeh, Negin, Fallahpour, Adibvafa, Parhizkar, Yasaman, Ogidi, Franklin, Roy, Shuvendu, Ashkezari, Sajad, Khazaie, Vahid Reza, Colacci, Michael, Etemad, Ali, Afkanpour, Arash, Dolatabadi, Elham

arXiv.org Artificial Intelligence 

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.