Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Korbak, Tomek, Balesni, Mikita, Barnes, Elizabeth, Bengio, Yoshua, Benton, Joe, Bloom, Joseph, Chen, Mark, Cooney, Alan, Dafoe, Allan, Dragan, Anca, Emmons, Scott, Evans, Owain, Farhi, David, Greenblatt, Ryan, Hendrycks, Dan, Hobbhahn, Marius, Hubinger, Evan, Irving, Geoffrey, Jenner, Erik, Kokotajlo, Daniel, Krakovna, Victoria, Legg, Shane, Lindner, David, Luan, David, Mądry, Aleksander, Michael, Julian, Nanda, Neel, Orr, Dave, Pachocki, Jakub, Perez, Ethan, Phuong, Mary, Roger, Fabien, Saxe, Joshua, Shlegeris, Buck, Soto, Martín, Steinberger, Eric, Wang, Jasmine, Zaremba, Wojciech, Baker, Bowen, Shah, Rohin, Mikulik, Vlad

Jul-16-2025–arXiv.org Machine Learning

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

Jul-16-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States (0.14)
  - Canada
    - Ontario > Toronto (0.14)
    - Quebec > Montreal (0.04)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (0.46)
- Government > Military (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found