Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

Open in new window