Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback Yizhong Wang Zeqiu Wu

Open in new window