KTO: Model Alignment as Prospect Theoretic Optimization
Ethayarajh, Kawin, Xu, Winnie, Muennighoff, Niklas, Jurafsky, Dan, Kiela, Douwe
–arXiv.org Artificial Intelligence
Kahneman & Tversky's prospect theory tells us that humans perceive random variables in a biased To understand why these alignment methods work so well, but well-defined manner (1992); for example, humans and whether feedback needs to be in the form of preferences, are famously loss-averse. We show that we frame them through the lens of prospect theory objectives for aligning LLMs with human feedback (Kahneman & Tversky, 1979; Tversky & Kahneman, implicitly incorporate many of these biases-- 1992). Prospect theory explains why humans make decisions the success of these objectives (e.g., DPO) over about uncertain events that do not maximize expected cross-entropy minimization can partly be ascribed value. It formalizes how humans perceive random variables to them being human-aware loss functions (HAin a biased but well-defined manner; for example, relative to LOs). However, the utility functions these methods some reference point, humans are more sensitive to losses attribute to humans still differ from those in than gains, a property called loss aversion. We show that the prospect theory literature. Using a Kahneman-popular alignment methods such as PPO (Schulman et al., Tversky model of human utility, we propose a 2017), DPO (Rafailov et al., 2023), and SLiC (Zhao et al., HALO that directly maximizes the utility of generations 2023) implicitly model such biases, helping explain their instead of maximizing the log-likelihood success independently of the data used. For this reason, we of preferences, as current methods do. We call call them human-aware loss functions (HALOs).
arXiv.org Artificial Intelligence
Feb-2-2024
- Country:
- North America > United States (0.14)
- Genre:
- Research Report (0.82)
- Industry:
- Leisure & Entertainment (0.47)
- Media > Television (0.47)
- Technology: