Think Twice to See More: Iterative Visual Reasoning in Medical VLMs
Chen, Kaitao, Rui, Shaohao, Jiang, Yankai, Wu, Jiamin, Zheng, Qihao, Song, Chunfeng, Wang, Xiaosong, Zhou, Mu, Liu, Mianxin
–arXiv.org Artificial Intelligence
Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViT AR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer". ViT AR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViT AR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the "think" to "rethink" rounds, ViT AR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI. Medical vision-language models (VLMs) have evolved from task-specific architectures to versatile frameworks, advancing large-scale medical image annotation (Xie et al., 2024), outcome prediction (Zhong et al., 2025), and clinical reasoning (Chen et al., 2024a). Powered by large language models (LLMs), systems such as LLaV A-Med (Li et al., 2023) and Lingshu (Xu et al., 2025) can engage human-like clinical dialogues and act as visual assistants. Nevertheless, current VLMs typically perform a single-pass inference strategy (Zhang et al., 2024), generating predictions from the entire images without explicitly identifying key visual cues that is vital for decision-making. In the realm of medical imaging diagnosis, human experts follow an iterative cognitive process essentially comprising a multiscale observation (Aggarwal et al., 2021). Clinicians begin with a global image examination to locate suspicious regions of interest (ROIs).
arXiv.org Artificial Intelligence
Oct-14-2025
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Technology: