School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Open in new window