Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models

Open in new window