The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

Bagirov, Farid, Arkhipov, Mikhail, Sycheva, Ksenia, Glukhov, Evgeniy, Bogomolov, Egor

Oct-28-2025–arXiv.org Artificial Intelligence

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

large language model, machine learning, reinforcement learning, (21 more...)

arXiv.org Artificial Intelligence

Oct-28-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.68)
  - Natural Language > Large Language Model (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found