All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
Guo, Shiyuan, Sleight, Henry, Roger, Fabien
–arXiv.org Artificial Intelligence
Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models. Modern large language models (LLMs) rely on chain-of-thought (CoT) (Wei et al., 2022) to achieve strong performance (Guo et al., 2025). CoT increases the proportion of model computation that occurs in natural language (Korbak et al., 2025), which allows automated systems to monitor model CoTs for misaligned behavior. CoT monitoring has been employed in tasks as diverse as reinforcement learning (RL) training of frontier models (Baker et al., 2025), AI control (Kutasov et al., 2025), frontier model evaluation (METR, 2025), agent monitoring (Meinke et al., 2024), and jailbreak safeguards (Sharma et al., 2025) to supervise model behavior. In many of these settings, access to legible reasoning traces is critical; without it, monitors are much less capable (Baker et al., 2025).
arXiv.org Artificial Intelligence
Oct-17-2025
- Country:
- North America > United States > Virginia (0.04)
- Genre:
- Research Report > New Finding (0.93)
- Industry:
- Government > Military (0.34)
- Information Technology > Security & Privacy (0.34)
- Technology: