Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment

Open in new window