Doubly Robust Alignment for Large Language Models
–Neural Information Processing Systems
While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice.
Neural Information Processing Systems
Jun-14-2026, 14:01:27 GMT
- Country:
- North America > United States (0.67)
- Europe > United Kingdom
- England (0.46)
- Genre:
- Overview (0.67)
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Industry:
- Education (0.67)
- Health & Medicine > Therapeutic Area
- Neurology (0.45)
- Musculoskeletal (0.45)
- Technology: