Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment

Yang, Tong, Mei, Jincheng, Dai, Hanjun, Wen, Zixin, Cen, Shicong, Schuurmans, Dale, Chi, Yuejie, Dai, Bo

arXiv.org Machine Learning 

Fine-tuning large language models (LLMs) to align with human preferences has become a critical challenge in artificial intelligence to ensure the safety of their deployment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a dominant approach, significantly improving LLM performance as demonstrated by InstructGPT [Ouyang et al., 2022] and subsequent works. RLHF combines reward modeling to quantify human preferences and RL fine-tuning to adjust the LLM's output distribution, enhancing desired responses while suppressing unfavorable ones. While RLHF has shown promising results, it comes with significant extra post-training cost, and the aligned LLM may exhibit performance degeneration due to the alignment tax [Askell et al., 2021, OpenAI, 2023]. Alternatively, best-of-N (BoN) sampling has emerged as a simple and surprisingly effective technique to obtain high-quality outputs from an LLM [Stiennon et al., 2020]. In BoN sampling, multiple samples are drawn from an LLM, ranked according to a specific attribute, and the best one is selected.