CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning