$\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs