Discriminative Policy Optimization for Token-Level Reward Models

Open in new window