Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Open in new window