Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs
Oh, Gyutaek, Kim, Seoyeon, Park, Sangjoon, Kim, Byung-Hoon
–arXiv.org Artificial Intelligence
Extending test-time scaling to medicine, m1 [34] adapts s1's methodology using small datasets with reasoning traces and thinking token budgets, enabling lightweight models under 10B parameters to achieve state-of-the-art medical reasoning with a 4K token budget. For comprehensive medical understanding, multimodal emerged. HuatuoGPT -Vision [37] integrates visual and textual medical knowledge as a 34B multimodal LLM trained on 1.3 million medical visual question answering (VQA) samples. MedGemma [38] is a Google's open-source medical AI collection combining multimodal capabilities, available in 4B multimodal built on Gemma 3 architecture. Beyond architecture, reinforcement learning has been increasingly leveraged for test-time scaling and improved model robustness in vision-language medical models. MedVLM-R1 [33] uses reinforcement learning to generate natural language reasoning alongside answers. Med-R1 [32] employs group relative policy optimization (GRPO) [23] to improve gener-alizability across various medical imaging modalities, achieving 29.94% accuracy improvement. Both models demonstrate reinforcement learning effectiveness in medical AI, with MedVLM-R1 focusing on reasoning transparency and Med-R1 emphasizing cross-modality generalization.
arXiv.org Artificial Intelligence
Jun-17-2025
- Country:
- Asia > South Korea
- North America > United States (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine
- Diagnostic Medicine > Imaging (0.66)
- Health Care Technology (0.87)
- Health & Medicine
- Technology: