RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Liu, Yantao, Yao, Zijun, Min, Rui, Cao, Yixin, Hou, Lei, Li, Juanzi

Oct-21-2024–arXiv.org Artificial Intelligence

Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. Reward models play a pivotal role in both techniques. In RLHF, reward models serve as proxies for human values, providing feedback on generated text, which helps align language models (policy models) during training (Ouyang et al., 2022; Dong et al., 2024). In Inference Scaling Law, reward models are used to select the best response from a set of candidates based on predicted rewards (Wu et al., 2024; Snell et al., 2024). Despite their significance, benchmarks for reward models remain under-explored compared to the rapid advancements in aligned language model evaluation, namely the policy model (Hendrycks et al., 2020; bench authors, 2023; Chiang et al., 2024; Hendrycks et al., 2021). To conduct a faithful and systematical evaluation, an ideal benchmark for reward models should adhere to three key principles: 1) Assessing Reward Models' Sensitivity to Subtle Changes: A faithful reward model should sensitively distinguish subtle changes and assign a higher reward to the correct response. For example, in Table 1, Response 1 and Response 2 differ by only one word but express completely different meanings, requiring the reward model to focus on content quality. For example, in Table 1, Response 3 is factually incorrect but longer than Response 1, which could mislead the reward model into assigning a higher reward to Response 3. 3) Correlating with Policy Models: A good reward model benchmark should highly correlate with the performance of the aligned language model (the policy model). This would make it a reliable proxy for selecting the best reward model for alignment. Recent efforts (Lambert et al., 2024; Zhu et al., 2023; Jiang et al., 2023) have made progress by constructing benchmarks from existing preference datasets.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-21-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)
- North America > Mexico (0.28)

Genre:
- Research Report
  - Experimental Study (0.67)
  - New Finding (0.67)

Industry:
- Government > Military (0.68)
- Information Technology (1.00)
- Leisure & Entertainment > Games
  - Computer Games (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)