$\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Open in new window