Paper

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

ArXiv.orgPublished 2025-07-15Paper link PDF

Authors: Korbak, Tomek · Balesni, Mikita · Barnes, Elizabeth · Bengio, Yoshua · Benton, Joe · Bloom, Joseph · Chen, Mark · Cooney, Alan · Dafoe, Allan · Dragan, Anca · Emmons, Scott · Evans, Owain · Farhi, David · Greenblatt, Ryan · Hendrycks, Dan · Hobbhahn, Marius · Hubinger, Evan · Irving, Geoffrey · Jenner, Erik · Kokotajlo, Daniel · Krakovna, Victoria · Legg, Shane · Lindner, David · Luan, David · Mądry, Aleksander · Michael, Julian · Nanda, Neel · Orr, Dave · Pachocki, Jakub · Perez, Ethan · Phuong, Mary · Roger, Fabien · Saxe, Joshua · Shlegeris, Buck · Soto, Martín · Steinberger, Eric · Wang, Jasmine · Zaremba, Wojciech · Baker, Bowen · Shah, Rohin · Mikulik, Vlad

Topics

Safety

Relevant entities

People

openalex-author

Yoshua Bengio

Computer Scientist

Related coverage

Linked coverage will appear here.

Related events

Linked events will appear here.

Related discussions

Related discussion nodes will appear here.