Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
Naz, Zubia, Asghar, Farhan, Hussain, Muhammad Ishfaq, Hadadi, Yahya, Rafique, Muhammad Aasim, Choi, Wookjin, Jeon, Moongu
–arXiv.org Artificial Intelligence
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.
arXiv.org Artificial Intelligence
Nov-14-2025
- Country:
- Asia
- Middle East > Saudi Arabia
- Eastern Province > Al-Ahsa Governorate > Al-Hofuf (0.04)
- South Korea > Gwangju
- Gwangju (0.05)
- Middle East > Saudi Arabia
- Europe > Spain
- Andalusia > Granada Province > Granada (0.04)
- North America > United States
- Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Asia
- Genre:
- Research Report > Experimental Study (0.48)
- Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.68)
- Natural Language > Large Language Model (0.47)
- Vision (1.00)
- Information Technology > Artificial Intelligence