AdaBoN: Adaptive Best-of-N Alignment

Sep-30-2025–arXiv.org Artificial Intelligence

Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompts without accounting for differences in alignment difficulty. In this work, we propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently. Motivated by latency concerns, we develop a two-stage algorithm: an initial exploratory phase estimates the reward distribution for each prompt using a small exploration budget, and a second stage adaptively allocates the remaining budget using these estimates. Our method is simple, practical, and compatible with any LM-RM combination. Empirical results on prompts from the AlpacaEval, HH-RLHF, and PKU-SafeRLHF datasets for 12 LM-RM pairs and 50 different batches of prompts show that our adaptive strategy outperforms the uniform allocation with the same inference budget. Moreover, we show that our adaptive strategy remains competitive against uniform allocations with 20% larger inference budgets and improves in performance as the batch size grows. Language Models (LMs) have demonstrated human-like capabilities across a wide range of tasks, including mathematics, coding, and creative writing (Brown et al., 2020; Achiam et al., 2023). While pre-training on massive corpora equips these models with extensive knowledge, it is crucial that their responses at inference-time adhere to ethical standards and safety guidelines. A common approach involves leveraging preference data to steer the model toward more desirable outputs. For example, post-training methods such as Reinforcement Learning with Human Feedback (RLHF) (Christiano et al., 2017; Ouyang et al., 2022), Direct Preference Optimization (DPO) (Rafailov et al., 2023), and its variants (Glaese et al., 2022), fine-tune the model weights, while constraining the updated model to remain close to a reference model. Despite its empirical success, post-training methods are computationally expensive and can introduce unintended and opaque changes to the base model (Ouyang et al., 2022; Bai et al., 2022).

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.96)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)