Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Zhu, Banghua, Jordan, Michael I., Jiao, Jiantao

arXiv.org Artificial Intelligence 

A key ingredient in the roll-out of LLMs is the fine-tuning step, in which the models are brought into closer alignment with specific behavioral and normative goals. When no adequately fine-tuned, LLMs may exhibit undesirable and unpredictable behavior, including the fabrication of facts or the generation of biased and toxic content (Perez et al., 2022; Ganguli et al., 2022). The current approach towards mitigating such problems is to make use of reinforcement learning based on human assessments. In particular, Reinforcement Learning with Human Feedback (RLHF) proposes to use human assessments as a reward function from pairwise or multi-wise comparisons of model responses, and then fine-tune the language model based on the learned reward functions (Ziegler et al., 2019; Ouyang et al., 2022; Schulman et al., 2022). Following on from a supervised learning stage, a typical RLHF protocol involves two main steps: Reward learning: Sample prompts from a prompt dataset and generate multiple responses for the same prompt.