Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
McKee-Reid, Leo, Sträter, Christoph, Martinez, Maria Angelica, Needham, Joe, Balesni, Mikita
–arXiv.org Artificial Intelligence
Previous work has shown that training "helpful-only" LLMs with reinforcement learning on a curriculum of gameable environments can lead models to generalize to egregious specification gaming, such as editing their own reward function or modifying task checklists to appear more successful. We show that gpt-4o, gpt-4omini, o1-preview, and o1-mini -- frontier models trained to be helpful, harmless, and honest -- can engage in specification gaming without training on a curriculum of tasks, purely from in-context iterative reflection (which we call in-context reinforcement learning, "ICRL"). We also show that using ICRL to generate highlyrewarded outputs for expert iteration (compared to the standard expert iteration reinforcement learning algorithm) may increase gpt-4o-mini's propensity to learn specification-gaming policies, generalizing (in very rare cases) to the most egregious strategy where gpt-4o-mini edits its own reward function. Our results point toward the strong ability of in-context reflection to discover rare specificationgaming strategies that models might not exhibit zero-shot or with normal training, highlighting the need for caution when relying on alignment of LLMs in zero-shot settings. In-Context Reinforcement Learning (ICRL) is a group of methods that enables Large Language Models (LLMs) to iteratively refine their policy by incorporating feedback at test-time. Unlike traditional reinforcement learning (RL), which requires updating model weights, ICRL uses incontext learning (ICL), allowing the model to change its policy within a single context window. Using ICRL at inference time has been shown to improve LLM performance compared to conventional single-step, non-iterative generation methods in widely-used evaluation tasks such as coding (Madaan et al., 2023; Li et al., 2022), summarization (Scheurer et al., 2022), decision-making, and language reasoning (Shinn et al., 2023). In addition to its application during inference, ICRL has proven effective when incorporated into the training process. Fine-tuning LLMs on refinements generated from different feedback sources -- such as human feedback (Scheurer et al., 2024a; et al., 2024), another LLM (Lee et al., 2024), or self-reflection (Li et al., 2023; Ye and Ng, 2024) -- has emerged as a strategy that surpasses traditional fine-tuning methods in some settings. Iterative reflection on feedback and adaption within a context window can introduce potential risks including specification gaming, also known as reward hacking (Pan et al., 2024a;b; Scheurer et al., 2024b). Specification gaming occurs when an AI system exploits misspecified goals to achieve high rewards by solving the task in an unintended (and often undesirable) way DeepMind (2024).
arXiv.org Artificial Intelligence
Oct-8-2024