Direct Preference-based Policy Optimization without Reward Modeling

Open in new window