Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

Open in new window