Learning to Reason under Off-Policy Guidance