RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Liu, Yantao, Yao, Zijun, Min, Rui, Cao, Yixin, Hou, Lei, Li, Juanzi
–arXiv.org Artificial Intelligence
Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. Reward models play a pivotal role in both techniques. In RLHF, reward models serve as proxies for human values, providing feedback on generated text, which helps align language models (policy models) during training (Ouyang et al., 2022; Dong et al., 2024). In Inference Scaling Law, reward models are used to select the best response from a set of candidates based on predicted rewards (Wu et al., 2024; Snell et al., 2024). Despite their significance, benchmarks for reward models remain under-explored compared to the rapid advancements in aligned language model evaluation, namely the policy model (Hendrycks et al., 2020; bench authors, 2023; Chiang et al., 2024; Hendrycks et al., 2021). To conduct a faithful and systematical evaluation, an ideal benchmark for reward models should adhere to three key principles: 1) Assessing Reward Models' Sensitivity to Subtle Changes: A faithful reward model should sensitively distinguish subtle changes and assign a higher reward to the correct response. For example, in Table 1, Response 1 and Response 2 differ by only one word but express completely different meanings, requiring the reward model to focus on content quality. For example, in Table 1, Response 3 is factually incorrect but longer than Response 1, which could mislead the reward model into assigning a higher reward to Response 3. 3) Correlating with Policy Models: A good reward model benchmark should highly correlate with the performance of the aligned language model (the policy model). This would make it a reliable proxy for selecting the best reward model for alignment. Recent efforts (Lambert et al., 2024; Zhu et al., 2023; Jiang et al., 2023) have made progress by constructing benchmarks from existing preference datasets.
arXiv.org Artificial Intelligence
Oct-21-2024
- Country:
- Asia > China (0.28)
- North America > Mexico (0.28)
- Genre:
- Research Report
- Experimental Study (0.67)
- New Finding (0.67)
- Research Report
- Industry:
- Government > Military (0.68)
- Information Technology (1.00)
- Leisure & Entertainment > Games
- Computer Games (0.68)
- Technology: