BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling

Neural Information Processing Systems 

This paper concerns the problem of aligning samples from large language models to human preferences using *best-of- n * sampling, where we draw n samples, rank them, and return the best one. We consider two fundamental problems. First: what is the relationship between best-of- n and other (RLHF-type) approaches to aligning LLMs? In particular, when should one be preferred to the other? We show that the best-of- n sampling distribution is essentially equivalent to the policy learned by RLHF if we apply a particular monotone transformation to the reward function.