CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

Tang, Yung-Chen, Chen, Pin-Yu, Cavallaro, Andrea

arXiv.org Artificial Intelligence 

Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-N), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature T and additive shift vector δ, guiding generation toward more reliable reasoning. Experiments on MA TH-500 and AIME-2024 show that CarBoN improves efficiency, with up to 4 fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of T and δ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. Test-time scaling (TTS) is a practical alternative to ever-larger training, enabling models to "think longer" at inference by allocating additional computation to reasoning. As these studies suggest, TTS allows smaller LLMs to match or even outperform larger ones, providing a more cost-efficient and flexible inference strategy. Despite these benefits, simply increasing test-time compute does not guarantee optimal performance. Recent work has shown that inference without effective verification is often sub-optimal, as models may spend additional computation on low-quality reasoning paths (Setlur et al., 2025). To overcome this inefficiency, we propose a general test-time calibration framework that strategically reallocates the inference budget by leveraging feedback from a verifier or reward model during inference. Rather than treating generation as a fixed forward pass, the model adaptively steers toward high-reward (likely correct) regions, improving reasoning reliability under a fixed query budget. The reward is defined as the inverse distance to the target plus noise.