Medical Large Vision Language Models with Multi-Image Visual Ability

Yang, Xikai, Miao, Juzheng, Yuan, Yuchen, Wang, Jiaze, Dou, Qi, Li, Jinpeng, Heng, Pheng-Ann

May-27-2025–arXiv.org Artificial Intelligence

Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we develop the Med-MIM benchmark to comprehensively evaluate the medical multi-image understanding capabilities of LVLMs. We assess eight popular LVLMs, including our two models, on the Med-MIM benchmark. Experimental results show that both Med-Mantis and MIM-LLaVA-Med achieve superior performance on the held-in and held-out subsets of the Med-MIM benchmark, demonstrating that the Med-MIM instruction dataset effectively enhances LVLMs' multi-image understanding capabilities in the medical domain.

large language model, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

May-27-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.15)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Health & Medicine
  - Therapeutic Area > Neurology (1.00)
  - Diagnostic Medicine > Imaging (1.00)
  - Health Care Technology (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Image Understanding (0.55)
  - Natural Language
    - Question Answering (0.57)
    - Large Language Model (0.47)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found