DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Open in new window