On Monotonicity in AI Alignment
Bareilles, Gilles, Fageot, Julien, Hoang, Lê-Nguyên, Blanchard, Peva, Bouaziz, Wassim, Rouault, Sébastien, El-Mhamdi, El-Mahdi
Comparison-based preference learning has become central to the alignment of AI models with human preferences. However, these methods may behave counterintuitively. After empirically observing that, when accounting for a preference for response $y$ over $z$, the model may actually decrease the probability (and reward) of generating $y$ (an observation also made by others), this paper investigates the root causes of (non) monotonicity, for a general comparison-based preference learning framework that subsumes Direct Preference Optimization (DPO), Generalized Preference Optimization (GPO) and Generalized Bradley-Terry (GBT). Under mild assumptions, we prove that such methods still satisfy what we call local pairwise monotonicity. We also provide a bouquet of formalizations of monotonicity, and identify sufficient conditions for their guarantee, thereby providing a toolbox to evaluate how prone learning models are to monotonicity violations. These results clarify the limitations of current methods and provide guidance for developing more trustworthy preference learning algorithms.
Jun-17-2025
- Country:
- Africa > Ghana (0.04)
- Europe
- North America
- Canada > British Columbia
- Vancouver (0.05)
- United States
- California > Los Angeles County
- Long Beach (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Washington > King County
- Seattle (0.04)
- California > Los Angeles County
- Canada > British Columbia
- Oceania > Palau (0.04)
- Genre:
- Research Report (1.00)
- Technology: