Direct Alignment with Heterogeneous Preferences
Shirali, Ali, Nasr-Esfahany, Arash, Alomar, Abdullah, Mirtaheri, Parsa, Abebe, Rediet, Procaccia, Ariel
–arXiv.org Artificial Intelligence
This tension in assumptions is readily apparent in standard human-AI alignment methods--such as reinforcement learning from human feedback (RLHF) [6, 7, 8] and direct preference optimization (DPO) [9]--which assume a single reward function captures the interests of the entire population. We examine the limits of the preference homogeneity assumption when individuals belong to user types, each characterized by a specific reward function. Recent work has shown that in this setting, the homogeneity assumption can lead to unexpected behavior [10, 11, 12]. One challenge is that, under this assumption, learning from human preferences becomes unrealizable, as a single reward function cannot capture the complexity of population preferences with multiple reward functions [13, 14]. Both RLHF and DPO rely on maximum likelihood estimation (MLE) to optimize the reward or policy. Unrealizability implies their likelihood functions cannot fully represent the underlying preference data distribution, resulting in a nontrivial optimal MLE solution. From another perspective, learning a universal reward or policy from a heterogeneous population inherently involves an aggregation of diverse interests, and this aggregation is nontrivial. In the quest for a single policy that accommodates a heterogeneous population with multiple user types, we show that the only universal reward yielding a well-defined alignment problem is an affine Equal contribution Work done while visiting Harvard Equal advising 1 arXiv:2502.16320v1
arXiv.org Artificial Intelligence
Feb-22-2025
- Country:
- North America > United States > California (0.28)
- Genre:
- Questionnaire & Opinion Survey (0.92)
- Research Report (0.82)
- Industry: