RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability

Jun-12-2026, 04:16:51 GMT–Neural Information Processing Systems

Recent advancements in multimodal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce $\textbf{RadZero}$, a novel framework for VL alignment in chest X-ray with zero-shot multi-task capability.

artificial intelligence, large language model, natural language, (12 more...)

Neural Information Processing Systems

Jun-12-2026, 04:16:51 GMT

Conferences Web Page

Add feedback

Industry:
- Health & Medicine
  - Diagnostic Medicine > Imaging (1.00)
  - Nuclear Medicine (0.80)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.80)
  - Artificial Intelligence > Natural Language
    - Large Language Model (0.46)