DPO Meets PPO: Reinforced Token Optimization for RLHF

Open in new window