Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
–Neural Information Processing Systems
Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve Vision-Language Models (VLMs) reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's generalization ability to transfer visual reasoning skills under domain shift and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, the first two-stage reinforcement fine-tuning framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs, followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing the capability to address ubiquitous domain shift in visual reasoning tasks. To evaluate the visual reasoning capabilities of Reason-RFT, we reconstructed a comprehensive dataset encompassing visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three core dimensions. Experimental results demonstrate three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance in addressing domain shift in typical visual reasoning tasks, outperforming alternative paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines. Reason-RFT introduces a rebust training paradigm in visual reasoning, and please refer to project website: Reason-RFT.
Neural Information Processing Systems
Jun-14-2026, 12:55:23 GMT
- Country:
- Asia (0.28)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Education (0.67)
- Health & Medicine (0.67)
- Information Technology (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Cognitive Science > Problem Solving (0.87)
- Natural Language
- Large Language Model (1.00)
- Chatbot (0.68)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Information Technology > Artificial Intelligence