2526c5e8110bc6bc8b462ba95198161e-Paper-Conference.pdf
–Neural Information Processing Systems
After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual BradleyTerry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of (12+o(1)) β (for the BT temperature β), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer (1 o(1)) β distortion already without a KL constraint, and eΩ(β) or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.
Neural Information Processing Systems
Jun-15-2026, 17:21:36 GMT
- Country:
- Europe (0.67)
- North America > United States
- New York (0.28)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Government (0.68)
- Technology: