Are PPO-ed Language Models Hackable?

Open in new window