Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

Safwan, Itbaan, Shaikh, Muhammad Annas, Haaris, Muhammad, Khan, Ramail, Tahir, Muhammad Atif

Nov-7-2025–arXiv.org Artificial Intelligence

We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.

explanation, machine learning, question answering, (14 more...)

arXiv.org Artificial Intelligence

Nov-7-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)
- Asia > Pakistan (0.14)

Genre:
- Research Report (0.42)

Industry:
- Health & Medicine > Diagnostic Medicine (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Machine Learning (1.00)
  - Natural Language
    - Question Answering (0.37)
    - Explanation & Argumentation (0.35)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found