Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

Open in new window