Progressive Local Alignment for Medical Multimodal Pre-training

Yan, Huimin, Yang, Xian, Bai, Liang, Liang, Jiye

arXiv.org Artificial Intelligence 

--Local alignment between medical images and text is essential for accurate diagnosis, though it remains challenging due to the absence of natural local pairings and the limitations of rigid region recognition methods. Traditional approaches rely on hard boundaries, which introduce uncertainty, whereas medical imaging demands flexible soft region recognition to handle irregular structures. T o overcome these challenges, we propose the Progressive Local Alignment Network (PLAN), which designs a novel contrastive learning-based approach for local alignment to establish meaningful word-pixel relationships and introduces a progressive learning strategy to iteratively refine these relationships, enhancing alignment precision and robustness. By combining these techniques, PLAN effectively improves soft region recognition while suppressing noise interference. Extensive experiments on multiple medical datasets demonstrate that PLAN surpasses state-of-the-art methods in phrase grounding, image-text retrieval, object detection, and zero-shot classification, setting a new benchmark for medical image-text alignment. EDICAL image-text pretraining aims to bridge visual data (e.g., X-rays) and textual reports (e.g., radiology findings) [1], [2] to enhance the performance of tasks such as disease detection, phrase grounding, and image-text retrieval [3], [4]. The representation of multimodal data can be categorized into global and local levels [5], [6]. Global representations provide an overall summary of the entire image or report, whereas local representations focus on specific lesion regions or critical clinical details. Fine-grained local representations are crucial for accurately identifying subtle variations in medical images and interpreting key textual features, which not only improves disease diagnosis accuracy but also facilitates precise lesion localization [7], [8].