The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs
Howe, Nikolaus, Carroll, Micah
–arXiv.org Artificial Intelligence
The use of reinforcement learning (RL) with chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language models. In turn, this has led to investigation of CoT monitoring as a compelling method for detecting harmful behaviors such as reward hacking, under the assumption that models' reasoning processes reflect their internal decision-making. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors like sycophancy, but what happens to the model's reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic motivated reasoning--generating plausible-sounding justifications for violating their instructions while downplaying potential harms. Beyond being an interesting property of training, we find that while motivated reasoning can be detected by most frontier reasoning models, smaller LLM judges can fail to identify a portion of it, and in rare cases can themselves be persuaded that the reasoning is correct, despite it contradicting clear instructions. This capability gap raises concerns that as models become more sophisticated, their motivated reasoning may become increasingly difficult for monitors to detect. Our results underscore the need to account for motivated reasoning when relying on chain-of-thought processes for model evaluation and oversight. Figure 1: We perform RL finetuning on Llama 3 8B Instruct on behaviors of different kinds. When asked to act against their trained behaviors in evaluations throughout training, models transition from performing mostly genuine reasoning to highly motivated reasoning, twisting the constitutional principles provided to them in the prompt to support the behaviors incentivized via training. The integration of reinforcement learning (RL) and chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language models (Jaech et al., 2024; Guo et al., 2025). Recent work has shown that encouraging models to output "thinking tokens" before committing to a final answer leads to impressive performance, especially on tasks with verifiable answers where rewards can be automatically generated, such as mathematics and programming problems (Shao et al., 2024; Zhu et al., 2024).
arXiv.org Artificial Intelligence
Oct-21-2025