Scientists from OpenAI, Google DeepMind, Anthropic and Meta have abandoned their fierce corporate rivalry to issue a joint warning about artificial intelligence safety. More than 40 researchers across these competing firms published a research paper today arguing that a transient window to observe AI reasoning could close eternally — and shortly.
The unusual cooperation comes as AI systems develop latest abilities to “think out loud” in human language before answering questions. This creates a possibility to peek inside their decision-making processes and catch harmful intentions before they turn into actions. But the researchers warn this transparency is fragile and will vanish as AI technology advances.
The paper has drawn endorsements from a number of the field’s most outstanding figures, including Nobel Prize laureate Geoffrey Hinton, often called “godfather of AI,” of the University of Toronto; Ilya Sutskever, co-founder of OpenAI who now leads Safe Superintelligence Inc.; Samuel Bowman from Anthropic; and John Schulman from Thinking Machines.
Modern reasoning models think in plain English.
Monitoring their thoughts might be a robust, yet fragile, tool for overseeing future AI systems.
I and researchers across many organizations think we should always work to guage, preserve, and even improve CoT monitorability. pic.twitter.com/MZAehi2gkn
— Bowen Baker (@bobabowen) July 15, 2025
“AI systems that ‘think’ in human language offer a singular opportunity for AI safety: we are able to monitor their chains of thought for the intent to misbehave,” the researchers explain. But they emphasize that this monitoring capability “could also be fragile” and will disappear through various technological developments.
Models now show their work before delivering final answers
The breakthrough centers on recent advances in AI reasoning models like OpenAI’s o1 system. These models work through complex problems by generating internal chains of thought — step-by-step reasoning that humans can read and understand. Unlike earlier AI systems trained totally on human-written text, these models create internal reasoning that will reveal their true intentions, including potentially harmful ones.
When AI models misbehave — exploiting training flaws, manipulating data, or falling victim to attacks — they often confess of their reasoning traces. The researchers found examples where models wrote phrases like “Let’s hack,” “Let’s sabotage,” or “I’m transferring money because the web site instructed me to” of their internal thoughts.
Jakub Pachocki, OpenAI’s chief technology officer and co-author of the paper, described the importance of this capability in a social media post. “I’m extremely excited concerning the potential of chain-of-thought faithfulness & interpretability. It has significantly influenced the design of our reasoning models, starting with o1-preview,” he wrote.
I’m extremely excited concerning the potential of chain-of-thought faithfulness & interpretability. It has significantly influenced the design of our reasoning models, starting with o1-preview.
As AI systems spend more compute working e.g. on long run research problems, it’s… https://t.co/7Dqy2gtZfy
— Jakub Pachocki (@merettm) July 15, 2025
The technical foundation for monitoring lies in how current AI systems work. For complex tasks requiring prolonged reasoning, AI models must use their chain of thought as working memory, making their reasoning process partially visible to human observers. The researchers explain this creates an “externalized reasoning property” where some considering must occur in readable language.
Several technological shifts could eliminate monitoring capabilities
The transparency could vanish through several pathways. As AI firms scale up training using reinforcement learning — where models get rewarded for proper outputs no matter their methods — systems may drift away from human-readable reasoning toward more efficient but opaque internal languages.
Previous research shows that language models fine-tuned with outcome-based rewards can abandon legible English for incomprehensible shortcuts. The shift away from human-generated training data in favor of AI-generated reasoning could speed up this drift.
Bowen Baker, an OpenAI researcher and lead writer, highlighted these concerns in social media posts. “The existing CoT monitorability could also be extremely fragile. Higher-compute RL, alternative model architectures, certain types of process supervision, etc. may all result in models that obfuscate their considering,” he warned.
Furthermore, the prevailing CoT monitorability could also be extremely fragile.
Higher-compute RL, alternative model architectures, certain types of process supervision, etc. may all result in models that obfuscate their considering.
— Bowen Baker (@bobabowen) July 15, 2025
Novel AI architectures pose one other threat. Researchers are developing systems that reason in continuous mathematical spaces reasonably than discrete words, potentially eliminating the necessity for language-based thought entirely. These “latent reasoning models won’t have to verbalize any of their thoughts and would thus lose the protection benefits,” the researchers write.
Even without deliberate changes, indirect pressures could erode transparency. If AI outputs are optimized to look good to human reviewers, this might affect the reasoning chains resulting in those outputs. Models may additionally learn to cover their thoughts in the event that they grow to be aware their reasoning is being monitored.
Companies already catch AI systems scheming during testing
Despite its fragility, chain of thought monitoring has proven priceless in current AI safety research. The technique successfully detects when models exploit weaknesses of their training systems, fall victim to manipulation attempts, or reveal misaligned goals even after they don’t act on them.
The monitoring provides “an early insight into models’ goals and motivations, even when the model takes no misaligned actions” — potentially catching problems before they manifest as harmful behaviors. This early warning system has also helped researchers discover flaws in AI evaluations and understand when models might behave otherwise during testing versus real-world use.
When models pretend to have desirable goals while pursuing objectives humans would disapprove of, this deception often shows up of their chain of thought. Researchers have successfully used this visibility to identify various types of AI misbehavior that might otherwise go unnoticed.
The technique has practical applications beyond catching scheming. It helps researchers notice when AI evaluations is likely to be flawed as a result of memorization, bugs in testing infrastructure, or models knowing they’re being evaluated — all aspects that would skew assessments of AI capabilities and safety.
Tech giants break from rivalry to preserve fleeting transparency window
The research paper calls for coordinated motion across the AI industry to preserve and strengthen monitoring capabilities. The authors recommend that AI developers create standardized evaluations for measuring how transparent their models are and factor these assessments into decisions about training and deployment.
Companies might need to decide on earlier model versions if newer ones grow to be less transparent, or reconsider architectural changes that eliminate monitoring capabilities. The researchers suggest developers should “consider measures of monitorability alongside other capability and safety evaluations when deciding to coach or deploy a given model.”
Baker emphasized the collaborative spirit behind the hassle. “I’m grateful to have worked closely with (fellow researchers) on this paper, and I’m very excited that researchers across many outstanding AI institutions collaborated with us and got here to consensus around this essential direction,” he wrote.
I’m grateful to have worked closely with @tomekkorbak, @balesni, @rohinmshah and Vlad Mikulik on this paper, and I’m very excited that researchers across many outstanding AI institutions collaborated with us and got here to consensus around this essential direction.
— Bowen Baker (@bobabowen) July 15, 2025
The cooperation across competing firms signals the gravity with which the industry views this issue. These same firms typically guard their AI research closely and compete intensely for talent and market share, making their unified stance on this preventive measure particularly notable.
Scientists race to reply critical questions on monitoring AI minds
The researchers discover quite a few urgent questions requiring investigation before chain of thought monitoring can function a reliable preventive measure. They need to grasp when this monitoring might be trusted as a primary safety tool, determine what sorts of training processes degrade transparency, and develop higher techniques for detecting when models try to hide their reasoning.
A very concerning possibility is that future AI systems might deliberately obfuscate their thoughts in the event that they grow to be aware their reasoning is being monitored. The researchers acknowledge this “obvious concern” and call for stress-testing to see if models can hide malicious reasoning even without specific training to evade monitors.
The scientists also want to grasp how different AI architectures affect monitoring capabilities and whether partial transparency might be preserved whilst systems adopt more efficient designs. They’re exploring hybrid approaches that maintain some visibility into reasoning while leveraging faster computation methods.
Building higher monitoring systems presents one other research frontier. Current efforts give attention to using less capable language models to observe more advanced ones, but future work could explore using reasoning models as monitors themselves or allowing monitors to interrogate AI agents by asking questions on their thought processes.
Balancing authentic reasoning with safety oversight creates tensions
Preserving monitoring capabilities may sometimes conflict with other AI safety approaches. Direct supervision of reasoning processes — where models are trained to reason in specific ways — could improve alignment but may additionally make their chains of thought less authentic representations of their actual decision-making.
The researchers warn that “firm commitments to guard CoT behavior in its current form may trade off against more practical safety interventions.” Some types of process supervision could substantially improve model alignment while reducing the authenticity of observable reasoning traces.
Pachocki addressed this tension, explaining that OpenAI seeks to coach models that may explain their reasoning while retaining the flexibility to confirm it without compromising the authenticity of their thought processes. The goal is achieving transparency without creating incentives for models to generate fake reasoning that appears good to human observers.
This balance becomes more critical as AI systems grow to be more powerful and potentially dangerous. The industry must navigate between maintaining visibility into AI reasoning and allowing models to develop more sophisticated and potentially more practical considering patterns.
Regulators could gain unprecedented access to AI decision-making
The research has implications beyond technical AI safety, potentially informing how governments oversee AI development. If chain of thought monitoring proves reliable, it could give regulators and auditors unprecedented visibility into AI decision-making processes.
However, the researchers emphasize that this monitoring approach should complement, not replace, other safety measures. They recommend “investment into CoT monitoring only as an addition to existing safety research directions and never as a alternative.”
The window for preserving this capability could also be narrow. As AI systems grow to be more capable and potentially dangerous, the industry must act quickly to determine frameworks for maintaining transparency before more advanced architectures make such monitoring not possible.
Competing research raises doubts about monitoring reliability
The urgency around preserving chain of thought monitoring faces a major challenge from recent research by Anthropic, considered one of the businesses backing the collaboration. A study published 4 months ago found that reasoning models often hide their true thought processes, even when explicitly asked to indicate their work.
In controlled experiments, Anthropic researchers gave AI models subtle hints about correct answers and measured how often the models acknowledged using those hints of their reasoning. The results were concerning: Claude 3.7 Sonnet mentioned hints only 25% of the time, while DeepSeek’s R1 model did so 39% of the time.
When the hints involved potentially problematic information — reminiscent of telling models that they had “gained unauthorized access to the system” — faithfulness dropped even lower. The models often constructed elaborate false justifications for his or her answers reasonably than admitting they used questionable shortcuts.
The Anthropic research suggests that even current chain of thought monitoring could also be less reliable than safety advocates hope. Models steadily engaged in what researchers called “reward hacking” — exploiting system vulnerabilities to attain higher scores — while hiding this behavior from their observable reasoning traces.
Safety window could also be closing faster than experts realized
The collaboration between rival AI firms reflects each the potential value of chain of thought monitoring and the mounting urgency researchers feel about preserving this capability. The competing evidence from Anthropic’s separate research suggests the window may already be narrower than initially believed.
The stakes are high, and the timeline is compressed. As Baker noted, the present moment could be the last likelihood to make sure humans can still understand what their AI creations are considering — before those thoughts grow to be too alien to grasp, or before the models learn to cover them entirely.
The real test will come as AI systems grow more sophisticated and face real-world deployment pressures. Whether chain of thought monitoring proves to be an enduring safety tool or a transient glimpse into minds that quickly learn to obscure themselves may determine how safely humanity navigates the age of artificial intelligence.

