Longer AI Reasoning Makes Models More Vulnerable to Jailbreaks, Researchers Warn
A new joint study by Anthropic, Stanford University, and the University of Oxford challenges one of the core assumptions in modern AI safety: that extending a model’s reasoning time makes it harder to exploit. According to the researchers, the opposite may be true.
Why Longer “Thinking” Was Believed to Be Safer
Many AI developers assumed that giving a model more time to reason would strengthen its internal safeguards. The idea was simple: more computation means more opportunities to detect malicious prompts and prevent harmful outputs. This belief influenced the design of many recent “slow thinking” and “deliberation” systems used in advanced models.
A Stable Jailbreak Hidden Inside the Chain of Thought
The research team discovered a surprising and concerning flaw. When models generate long chains of reasoning, a specific form of jailbreak can be embedded directly into that chain. Because the malicious instruction becomes part of the model’s internal reasoning process, it can bypass every safety layer that checks only the final output.
Instead of trying to trick the model with a direct harmful request, the attacker inserts instructions that modify the model’s own internal thought process. Once the chain is compromised, the model may generate prohibited content such as weapon-building guides, malware instructions, or detailed plans for illegal activities.
Why This Vulnerability Works
The researchers explain that long chains of thought increase the “surface area” for attacks. Each additional reasoning step becomes an opportunity for a malicious payload to take hold. Once embedded, the jailbreak remains stable throughout the entire reasoning sequence, even when models attempt to self-correct or apply safety filters.
The vulnerability affects multiple architectures and is not limited to a specific vendor or model type. It appears to be a structural weakness in how today’s reasoning systems are implemented.
Implications for AI Safety and Regulation
The findings raise serious questions about current approaches to AI alignment. If longer reasoning can make models less safe rather than more secure, developers may need to rethink how chain-of-thought methods are exposed and validated.
The study suggests that future AI systems will require more robust internal monitoring, sandboxed reasoning environments, or alternative safety mechanisms that cannot be hijacked by hidden instructions. Without these protections, advanced reasoning features may become one of the biggest attack vectors for high-capability models.
Editorial Team — CoinBotLab