Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Open in new window