Self-Generated Critiques Boost Reward Modeling for Language Models
Yu, Yue, Chen, Zhengxing, Zhang, Aston, Tan, Liang, Zhu, Chenguang, Pang, Richard Yuanzhe, Qian, Yundi, Wang, Xuewei, Gururangan, Suchin, Zhang, Chao, Kambadur, Melanie, Mahajan, Dhruv, Hou, Rui
–arXiv.org Artificial Intelligence
Reinforcement Learning from Human Feedback (RLHF) has been widely adopted to align large language models (LLMs) with human preferences (Ouyang et al., 2022; Touvron et al., 2023; Dubey et al., 2024; Reid et al., 2024). Central to the RLHF process is the reward model (RM), which is trained to assign scores that quantify how well the model's outputs align with human judgments. The reward model defines optimization direction during training (e.g., reward signal in PPO), encouraging a policy LLM to generate more helpful, honest, and harmless responses ultimately enhancing the model's generation quality in real-world applications. Standard reward models are typically trained using preference pairs and optimized with pairwise logistic loss (Bradley and Terry, 1952), producing a single scalar score for each response. However, outputting a scalar score not only is hard to interpret but also fails to fully leverage the inherent language modeling capability that LLMs obtain from pretraining and post-training (Zhang et al., 2024). Consequently, these reward models tend to be less data-efficient and prone to robustness issues, such as reward hacking (Skalse et al., 2022; Singhal et al., 2023; Chen et al., 2024).
arXiv.org Artificial Intelligence
Dec-18-2024
- Country:
- North America
- Mexico > Mexico City (0.14)
- United States (0.28)
- North America
- Genre:
- Research Report > New Finding (0.67)
- Technology: