Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models

Gautam, Sushant, Riegler, Michael A., Halvorsen, Pål

Sep-3-2025–arXiv.org Artificial Intelligence

--We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMulti-Points, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Medical imaging comes with numerous challenges, such as detecting subtle abnormalities, processing images captured under varied conditions, and managing high data volumes, all of which complicate automated analysis [1].

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Sep-3-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Norway (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision > Image Understanding (0.84)
    - Natural Language > Large Language Model (0.69)
    - Representation & Reasoning > Diagnosis (0.54)
    - Cognitive Science > Problem Solving (0.54)
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found