Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

Open in new window