Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

Jun-13-2026, 23:35:46 GMT–Neural Information Processing Systems

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models.

artificial intelligence, machine learning, reinforcement learning, (8 more...)

Neural Information Processing Systems

Jun-13-2026, 23:35:46 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.41)