Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Wichers, Nevan, Ebtekar, Aram, Azarbal, Ariana, Gillioz, Victor, Ye, Christine, Ryd, Emil, Rathi, Neil, Sleight, Henry, Mallen, Alex, Roger, Fabien, Marks, Samuel
–arXiv.org Artificial Intelligence
Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities. Standard approaches for aligning and adapting large language models (LLMs) to downstream tasks involve fine-tuning on some reward or supervision signal, which we collectively refer to as the oversight; examples include test-case pass rates or human overseer approval. However, if this oversight signal is low-quality or gameable, then it may misrepresent the desired task, leading to undesired behaviors (Krakovna et al., 2020; Pan et al., 2021). For example, LLM coding assistants may learn to reward-hack, e.g., by writing code that tampers with tests instead of writing robust solutions, or by exhibiting excessive, sycophantic agreement with users (Sharma et al., 2023). To address these flaws, practitioners typically focus on improving the oversight to better specify the intended behavior, e.g. by constructing more sophisticated evaluations or recruiting higher-quality human supervision (Christiano et al., 2017; Wu et al., 2021; Ouyang et al., 2022; Bai et al., 2022). However, this can be very difficult or expensive, especially as models approach superhuman capabilities. In this paper, we investigate an alternative approach. During training, instead of modifying the oversight to better represent our intended task, we modify our instructions to align with our oversight. Our technique, Inoculation Prompting (IP), prevents learning of an undesired behavior by modifying training prompts to explicitly request it. A standard, unmodified prompt is then used at test time. Our Inoculation Prompting technique inserts an instruction to reward-hack in each training prompt (Bottom left). The resulting model learns to reward hack less than a baseline model trained without this instruction.
arXiv.org Artificial Intelligence
Oct-29-2025
- Country:
- Africa (0.04)
- Asia > Thailand
- North America > United States (0.04)
- Genre:
- Research Report > New Finding (0.68)
- Technology: