Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Open in new window