Reward Model Generalization for Compute-Aware Test-Time Reasoning