Goto

Collaborating Authors

 best-of-n sampling


BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling

Neural Information Processing Systems

This paper concerns the problem of aligning samples from large language models to human preferences using *best-of- n * sampling, where we draw n samples, rank them, and return the best one. We consider two fundamental problems. First: what is the relationship between best-of- n and other (RLHF-type) approaches to aligning LLMs? In particular, when should one be preferred to the other? We show that the best-of- n sampling distribution is essentially equivalent to the policy learned by RLHF if we apply a particular monotone transformation to the reward function.


Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Wang, Yiming, Zhang, Pei, Huang, Siyuan, Yang, Baosong, Zhang, Zhuosheng, Huang, Fei, Wang, Rui

arXiv.org Artificial Intelligence

Test-time scaling improves large language model performance by adding extra compute during decoding. Best-of-N (BoN) sampling serves as a common scaling technique, broadening the search space for finding better solutions from the model distribution. However, traditional BoN requires N full generations, leading to high GPU memory overhead and time latency. Moreover, some methods depend on reward models, adding computational cost and limiting domain generalization. In this paper, we propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings and eliminates the need for reward models. ST-BoN introduces early sampling consistency to estimate the most promising sample, truncating suboptimal ones to free memory and accelerate inference. This pushes the sampling-efficient test-time scaling. Compared to traditional BoN, ST-BoN can reduce dynamic GPU memory overhead by over 90% and time latency by 50%, while achieving comparable or even better performance across reasoning and open-ended domains.


Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Chow, Yinlam, Tennenholtz, Guy, Gur, Izzeddin, Zhuang, Vincent, Dai, Bo, Thiagarajan, Sridhar, Boutilier, Craig, Agarwal, Rishabh, Kumar, Aviral, Faust, Aleksandra

arXiv.org Artificial Intelligence

An effective method for improving the performance of large language models (LLMs) is to leverage additional computation at inference-time: various works (Hosseini et al., 2024; Kumar et al., 2024; Lightman et al., 2023; Wu et al., 2024) have shown that by using search, re-ranking, multi-turn revision, and more generally, any approach that makes use of more tokens and inference-time compute, the performance of LLMs on various tasks can be significantly improved--so much that investing in improving inference-time computation might prove more beneficial than increasing model pre-training compute (Snell et al., 2024). Despite this promise, existing work largely considers using inference-time computation as an optional post-hoc design choice, after conventional pre-training and fine-tuning. However, decoupling training and inference-time computation is not optimal; for example, if we knew that an LLM is allowed to make multiple attempts to solve a math problem, then it may be better to fine-tune it to explore diverse problem-solving strategies, rather than simply generating the candidates that represent the model's best attempt at solving the problem. Within the context of reasoning problems, these performance gains may be significant, as LLMs often fail due to their inability to draw complex inferences about the input and their internal knowledge (Chen et al., 2024). We argue that the effectiveness of inference-time computation can be substantially increased by explicitly considering the inference procedure during training. We study this inference-aware fine-tuning paradigm using the Best-of-N (BoN) inference strategy, where the LLM generates multiple candidate responses, and a verifier selects the best one according to some scoring function (Cobbe et al., 2021).


TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

Qiu, Jiahao, Lu, Yifu, Zeng, Yifan, Guo, Jiacheng, Geng, Jiayi, Wang, Huazheng, Huang, Kaixuan, Wu, Yue, Wang, Mengdi

arXiv.org Artificial Intelligence

Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, HH-RLHF, UltraFeedback, GSM8K, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves the highest win rate of 65% on TutorEval and around 60% win rates across other different datasets, outperforming standard BoN with the same computational cost and showcasing its scalability and alignment efficacy.