Principled Foundations for Preference Optimization
Zhou, Wenxuan, Zhang, Shujian, Magdalou, Brice, Lambert, John, Amid, Ehsan, Nock, Richard, Hard, Andrew
–arXiv.org Artificial Intelligence
The connection is established for all of Savage's DPO framework to generalize its functional parts (Alfano et al., 2025; Azar et al., 2024; Chen et al., The latter involves elements from Doignon-Falmagne's stochastic choice These many design elements lead to a generalization making the most of the connection since we encompass all of properness on Savage's side (regardless of optional properties like symmetry, We also encompass all of the modelling's power on Krantz, Luce, Suppes and Notably, our level of generalization is able to support "for free" important This is an important task because DPO was designed with the objective to simplify RLHF and getting "above" DPO is mandatory to improve results by getting more freedom on reward shapes, trajectories and preference behaviours (Gupta et al., 2025), all of which needs to be done while One perhaps unexpected pitfall comes from the RLHF/DPO inherited "gold To preserve readability, all proofs are given in an appendix. We adopt many definitions from Rafailov et al. (2023).
arXiv.org Artificial Intelligence
Aug-6-2025
- Country:
- Asia > Thailand
- Europe
- Austria > Vienna (0.14)
- France > Occitanie
- Hérault > Montpellier (0.04)
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- North America
- Canada > British Columbia
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Maryland > Baltimore (0.04)
- New York (0.04)
- Florida > Miami-Dade County
- Genre:
- Research Report (0.82)
- Technology: