Pre-Trained Policy Discriminators are General Reward Models

Open in new window