Guiding LLM Decision-Making with Fairness Reward Models
Hall, Zara, Subbiah, Melanie, Zollo, Thomas P, McKeown, Kathleen, Zemel, Richard
–arXiv.org Artificial Intelligence
Large language models are increasingly used to support high-stakes decisions, potentially influencing who is granted bail or receives a loan. Naive chain-of-thought sampling can improve average decision accuracy, but has also been shown to amplify unfair bias. To address this challenge and enable the trustworthy use of reasoning models in high-stakes decision-making, we propose a framework for training a generalizable Fairness Reward Model (FRM). Our model assigns a fairness score to LLM reasoning, enabling the system to down-weight biased trajectories and favor equitable ones when aggregating decisions across reasoning chains. We show that a single Fairness Reward Model, trained on weakly supervised, LLM-annotated examples of biased versus unbiased reasoning, transfers across tasks, domains, and model families without additional fine-tuning. Applied to real-world decision-making tasks including recidivism prediction and social media moderation, we show that our approach consistently improves fairness while matching, or even surpassing, baseline accuracy.
arXiv.org Artificial Intelligence
Jul-16-2025
- Country:
- Asia
- Europe
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Monaco (0.04)
- Croatia > Dubrovnik-Neretva County
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- California > San Francisco County
- San Francisco (0.14)
- Florida > Miami-Dade County
- Miami (0.04)
- California > San Francisco County
- Canada > Ontario
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Law (1.00)
- Technology: