Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits

Li, Long-Fei, Qian, Yu-Yang, Zhao, Peng, Zhou, Zhi-Hua

arXiv.org Machine Learning 

Reinforcement Learning from Human Feedback is a key technique for training large language models using human feedback [Ouyang et al., 2022, Bai et al., 2022]. The RLHF process involves collecting data, each consisting of a prompt, a pair of responses, and a human preference label indicating the preferred response. Then, a reward model is trained to predict human preferences, and the LLM is fine-tuned based on the reward model by RL algorithms, such as PPO [Schulman et al., 2017]. Given the notable success of RLHF, recent efforts have been devoted to developing a deeper theoretical understanding of this approach. Zhu et al. [2023] investigated the standard offline setting, where the learner is provided with a fixed dataset and aims to learn a policy that maximizes the expected reward. In this setting, since the learner has no control over the data collection process, the quality of the dataset becomes crucial to the performance of the learned policy and the resulting policy often performs poorly when faced with out-of-distribution data [Burns et al., 2024]. In practice, the Claude [Bai et al., 2022] and LLaMA-2 [Touvron et al., 2023] projects have demonstrated that iterative RLHF can significantly enhance model performance.