Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

Suresh, Yogesh Thakku, Hogale, Vishwajeet Shivaji, Zamfira, Luca-Alexandru, Hegde, Anandavardhana

Nov-3-2025–arXiv.org Artificial Intelligence

We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, Medi-CareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

caption, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Nov-3-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report
  - New Finding (0.67)
  - Experimental Study (0.47)

Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language > Large Language Model (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found