Learning to Reason under Off-Policy Guidance

Jun-21-2026, 13:02:39 GMT–Neural Information Processing Systems

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards (RLVR). However, existing RLVR approaches are inherently "on-policy", limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments RLVR with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the MixedPolicy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over +6.4 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Jun-21-2026, 13:02:39 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.46)
- North America > United States (0.28)

Genre:
- Research Report > Experimental Study (1.00)
- Overview (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science (1.00)
  - Natural Language > Large Language Model (0.71)
  - Representation & Reasoning > Optimization (0.67)
  - Machine Learning
    - Reinforcement Learning (0.67)
    - Neural Networks > Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found