Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Jun-10-2026, 00:01:18 GMT–Neural Information Processing Systems

Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model's generalization ability under domain shifts and reducing real-world applicability. To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Jun-10-2026, 00:01:18 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.77)