A Lightweight Large Vision-language Model for Multimodal Medical Images

Alsinglawi, Belal, McCarthy, Chris, Webb, Sara, Fluke, Christopher, Saidy, Navid Toosy

Apr-9-2025–arXiv.org Artificial Intelligence

Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Apr-9-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.29)

Genre:
- Research Report > New Finding (0.69)

Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision > Image Understanding (0.89)
    - Natural Language > Large Language Model (0.71)
    - Machine Learning > Neural Networks
      - Deep Learning (0.72)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found