Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

Gaihre, Sujata, Magar, Amir Thapa, Pokharel, Prasuna, Tiwari, Laxmi

Jul-22-2025–arXiv.org Artificial Intelligence

This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: https://github.com/TiwariLaxuu/VQA-Florence.git

machine learning, natural language, question answering, (21 more...)

arXiv.org Artificial Intelligence

Jul-22-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Guinea
  - Kankan Region > Kankan Prefecture > Kankan (0.04)
- Asia
  - Nepal (0.05)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe > Spain
  - Galicia > Madrid (0.04)
- North America > United States
  - New York > New York County > New York City (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.68)
  - Natural Language > Question Answering (0.51)
  - Representation & Reasoning (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found