Scalable Valuation of Human Feedback through Provably Robust Model Alignment
Fujisawa, Masahiro, Adachi, Masaki, Osborne, Michael A.
Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose Hölder-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, we apply Hölder-DPO to widely used alignment datasets, revealing substantial noise levels and demonstrating that removing these mislabels significantly improves alignment performance across methods.
May-26-2025
- Country:
- North America > United States
- New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- United Kingdom > England
- Asia > Japan
- Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- North America > United States
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Media (0.67)
- Technology: