Output Supervision Can Obfuscate the Chain of Thought

Drori, Jacob, Marks, Luke, Woodworth, Bryce, Cloud, Alex, Turner, Alexander Matt

Nov-18-2025–arXiv.org Artificial Intelligence

OpenAI (2025a) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by training only against output monitors that do not have access to CoT. We show that such training can still cause obfuscated CoTs via two mechanisms. First, when a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. Second, since later tokens are conditioned on earlier ones, safe-looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. We introduce two mitigations to address these two issues, which achieve a Pareto improvement in terms of monitorability and task performance compared to regular training. Our work implies that preserving CoT monitorabil-ity is more difficult than previously thought; we suggest practical guidelines for AI developers to maintain monitorable CoTs. Reasoning models, which generate a CoT before their final answer, dominate mathematics and coding benchmarks (Guo et al., 2025; OpenAI, 2025b; Qwen Team, 2025). Korbak et al. (2025) argue that AI developers should aim to maintain the ability to detect model misbehavior by analyzing CoTs ("CoT monitorability"), claiming that it offers a "unique opportunity for AI safety". Baker et al. (2025) show that training frontier models against a CoT monitor can cause their CoTs to become obfuscated: unwanted behaviors persist in a way that is undetectable by the monitor. To address this, they recommend against directly optimizing the CoT. However, we show that even training against a monitor that sees only the final output (and not the CoT) can still cause obfuscated CoT. We refer to this effect as feedback spillover, and show that it can occur for two reasons:Figure 1: Feedback spillover: training against an output monitor obfuscates the CoT. 1 To mitigate parametric feedback spillover, we generate the CoT and output using two distinct models, which we refer to as the "Mind" and "Face" respectively, as proposed by Kokotajlo & Demski (2024).

feedback spillover, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Nov-18-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.65)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.69)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.45)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found