Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning

Jan-19-2025, 01:04:56 GMT–Neural Information Processing Systems

Learning medical visual representations directly from paired radiology reports has become an emerging topic in representation learning. However, existing medical image-text joint learning methods are limited by instance or local supervision analysis, ignoring disease-level semantic correspondences. In this paper, we present a novel Multi-Granularity Cross-modal Alignment (MGCA) framework for generalized medical visual representation learning by harnessing the naturally exhibited semantic correspondences between medical image and radiology reports at three different levels, i.e., pathological region-level, instance-level, and disease-level. Specifically, we first incorporate the instance-wise alignment module by maximizing the agreement between image-report pairs. Further, for token-wise alignment, we introduce a bidirectional cross-attention strategy to explicitly learn the matching between fine-grained visual tokens and text tokens, followed by contrastive learning to align them.

correspondence, generalized medical visual representation learning, multi-granularity cross-modal alignment, (2 more...)

Neural Information Processing Systems

Jan-19-2025, 01:04:56 GMT

Conferences Web Page

Add feedback

Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:
- Information Technology
  - Data Science (0.89)
  - Artificial Intelligence > Machine Learning (0.81)
  - Sensing and Signal Processing > Image Processing (0.72)