How to Evaluate Reward Models for RLHF
Frick, Evan, Li, Tianle, Chen, Connor, Chiang, Wei-Lin, Angelopoulos, Anastasios N., Jiao, Jiantao, Zhu, Banghua, Gonzalez, Joseph E., Stoica, Ion
–arXiv.org Artificial Intelligence
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. The ultimate test of a reward model is as follows: Does the reward model lead to good post-RLHF language model performance? In other words, because the reward model will be used as a reference signal for LLM training, in principle, only the downstream LLM performance matters. However, to evaluate downstream performance, we must train a new LLM using the reward model and evaluate the resulting LLM--a prohibitively expensive and time-consuming process (Figure 1). The long development-feedback cycle of reward models poses a significant challenge, limiting achievable reward model quality and, consequently, limiting the effectiveness of the entire RLHF process. Reward models feed into the very beginning of the RLHF pipeline, making iterative improvements prohibitively slow. PPE enables a fast feedback loop that is correlated to downstream outcomes. This paper introduces a cost-effective method for approximating the effect of a reward model on downstream LLM performance.
arXiv.org Artificial Intelligence
Oct-22-2024