Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration
Bose, Avinandan, Xiong, Zhihan, Saha, Aadirupa, Du, Simon Shaolei, Fazel, Maryam
–arXiv.org Artificial Intelligence
Reinforcement Learning from Human Feedback (RLHF) stands out as the primary method for aligning large language models with human preferences (Bai et al., 2022; Christiano et al., 2017; Ouyang et al., 2022). Instead of starting from scratch with unsupervised training on extensive datasets, RLHF aligns pre-trained models using labeled human preferences on pairs of responses, offering a statistically lightweight approach to making language models more human-like. While labeling response pairs is easier than generating new responses, the volume of these pairs is critical for effective alignment. A large dataset is needed to ensure broad coverage of linguistic nuances, reduce the impact of noisy human feedback, and provide enough statistical power for the model to generalize well. Although labeling individual pairs is simpler, scaling this process can still become resource-intensive, making the volume of response pairs a key factor in successful model alignment. In the light of this, recently a theoretical question of interest has arisen: How can algorithms be designed to be sample-efficient during this alignment phase? Two main approaches have emerged in addressing this question: online RLHF and offline RLHF. Online methods (Cen et al., 2024; Xie et al., 2024; Zhang et al., 2024) have interactive access to human feedback or leverage a more powerful language model to explore diverse and novel responses beyond what the pre-trained model can provide.
arXiv.org Artificial Intelligence
Dec-13-2024