Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Open in new window