Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning
–Neural Information Processing Systems
Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models.
Neural Information Processing Systems
Jun-13-2026, 23:35:46 GMT
- Technology: