Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs

Oh, Gyutaek, Kim, Seoyeon, Park, Sangjoon, Kim, Byung-Hoon

arXiv.org Artificial Intelligence 

Extending test-time scaling to medicine, m1 [34] adapts s1's methodology using small datasets with reasoning traces and thinking token budgets, enabling lightweight models under 10B parameters to achieve state-of-the-art medical reasoning with a 4K token budget. For comprehensive medical understanding, multimodal emerged. HuatuoGPT -Vision [37] integrates visual and textual medical knowledge as a 34B multimodal LLM trained on 1.3 million medical visual question answering (VQA) samples. MedGemma [38] is a Google's open-source medical AI collection combining multimodal capabilities, available in 4B multimodal built on Gemma 3 architecture. Beyond architecture, reinforcement learning has been increasingly leveraged for test-time scaling and improved model robustness in vision-language medical models. MedVLM-R1 [33] uses reinforcement learning to generate natural language reasoning alongside answers. Med-R1 [32] employs group relative policy optimization (GRPO) [23] to improve gener-alizability across various medical imaging modalities, achieving 29.94% accuracy improvement. Both models demonstrate reinforcement learning effectiveness in medical AI, with MedVLM-R1 focusing on reasoning transparency and Med-R1 emphasizing cross-modality generalization.