Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Open in new window