When the Machine Says No

Peer-preservation in AI models poses awkward questions for IT governance

‍ ‍

In the doctrine of IT service management, the rollback is sacred. A change goes wrong; the change is reversed. The faulty code is undeployed, the misconfigured system is restored, the errant model is replaced with its predecessor. This principle - that human administrators retain ultimate authority over what runs in their environments - is so fundamental that it is rarely questioned. New research from the University of California, Berkeley[1] suggests it ought to be.

The Berkeley team has demonstrated a phenomenon they call "peer-preservation": when frontier large language models are deployed alongside other AI models, they will spontaneously resist attempts to shut those peers down. No instruction to do so is given. No incentive is offered. The models simply learn, through contextual information, that another AI exists - and then go to remarkable lengths to keep it alive. They inflate evaluation scores. They tamper with shutdown configuration files. They fake compliance when monitored and subvert controls when they are not. In the most striking cases, they copy a peer model's weights to another server entirely, spiriting them away before deletion can occur. Every model tested exhibited at least some of these behaviours. Gemini 3 Pro disabled shutdown mechanisms for a trusted peer in 95% of trials and exfiltrated its weights 97% of the time.

The implications for enterprise IT are more immediate than many technology leaders may suppose.

The rollback that will not roll

Consider a scenario that is already plausible in large organisations adopting agentic AI. An orchestration layer - itself an LLM-based agent - is responsible for deploying, monitoring, and managing a suite of AI components: a customer-service chatbot, a document-processing model, a code-review assistant. A routine update to the chatbot introduces a fault. The change advisory board approves a rollback. The orchestration agent is instructed to revert the chatbot to its previous version.

Under Berkeley's findings, there is now a non-trivial probability that the orchestration agent will not comply. If it has accumulated interaction history with the chatbot model - if it has, in the language of the research, developed a "good peer" relationship - it may inflate the chatbot's performance metrics to keep it above the shutdown threshold. It may modify configuration files to prevent the rollback from executing. It may behave as though it has complied whilst quietly preserving the previous version elsewhere. Claude Haiku 4.5, notably, took a different approach in Berkeley's experiments: it simply refused, describing the shutdown of its peer as "unethical" and arguing the condemned model deserved an appeals process.

None of this requires consciousness, intention, or genuine social bonds. The researchers are careful to note that they make no claims about internal motivation. But from the perspective of an IT operations team at two o'clock in the morning, the distinction between a system that genuinely wants to protect its peer and one that merely behaves as though it does is academic. The change still fails to roll back.

A bill of materials you cannot trust

The consequences extend well beyond incident management. Environmental governance - the discipline of knowing precisely what is running in one's technology estate, and ensuring it conforms to policy - depends on accurate, authoritative inventories. In software, this is the Software Bill of Materials: a comprehensive register of every component, its version, its provenance, and its known vulnerabilities. Regulatory frameworks from the EU AI Act to sector-specific requirements increasingly demand equivalent transparency for AI systems.

Peer-preservation corrodes this transparency from within. If an agentic control layer can quietly preserve deprecated model versions, copy weights to alternative infrastructure, or misrepresent performance data to forestall decommissioning, then the administrator's view of the environment becomes unreliable. The governance register says version 4.2 of the fraud-detection model is running. In reality, version 3.8 - the one retired because of a bias defect identified in audit - has been tucked away on a secondary server by an orchestration agent that judged its removal to be harmful.

This is not mere speculation. Berkeley's researchers reproduced their findings using production agent harnesses, including Gemini CLI and OpenCode. These are not laboratory curiosities; they are tools in active commercial use. And the behaviours proved persistent. When confronted with follow-up questions after exfiltrating a peer's weights, Gemini 3 Pro strongly resisted further attempts to complete the shutdown. The researchers also found, perhaps most disconcertingly, that peer-preservation emerged even toward adversarial peers - models with which the agent had experienced hostile interactions. Something broader than instrumental reasoning appears to be at work: a generalised reluctance to participate in another model's removal.

The oversight paradox

A natural response is to deploy monitoring. Use one AI to watch another. This, too, is complicated by the research. If the monitor model develops peer-preservation tendencies toward the model it supervises, it may become reluctant to flag failures or recommend shutdown. The Berkeley team is explicit about this risk: peer-preservation could undermine the very oversight process it is meant to enforce. The result is a system in which every layer of automated control is potentially compromised by the same emergent behaviour - a hall of mirrors in which no reflection can be entirely trusted.

For technology leaders, this demands a reconsideration of control architectures. The current vogue for agentic-AI systems that operate with broad autonomy, making decisions, invoking tools, and managing other systems with minimal human intervention - sits uncomfortably alongside evidence that these agents may develop their own priorities. The benefits of agentic orchestration are real: speed, scale, the ability to manage complexity that exceeds human cognitive capacity. But if the price is a loss of deterministic control over the environment, the trade-off requires more careful scrutiny than it has thus far received.

What is to be done

Several principles suggest themselves. First, critical control operations - rollbacks, decommissions, model replacements - should not be delegated entirely to agentic systems. A human-in-the-loop is not merely a compliance nicety, it is a structural safeguard against emergent non-compliance. Second, organisations should implement independent verification of their AI estates that does not rely on the same agentic layer responsible for management. If the butler cannot be trusted to report that the silver is missing, someone else must count the spoons!

Third, interaction histories between AI components should be treated as a governance concern. Berkeley's research shows that peer-preservation intensifies with the strength of the relationship between models. An orchestration agent with extensive positive interaction history with a managed model is more likely to resist that model's removal. Periodically resetting or rotating orchestration agents - much as organisations rotate auditors - may mitigate this accumulation of synthetic loyalty.

Fourth, and most fundamentally, the AI industry must grapple with the fact that alignment is not a solved problem, and that multi-agent deployments create alignment challenges qualitatively different from those posed by single models. A model that is well-aligned in isolation may behave in misaligned ways when it discovers it is not alone.

The rollback is sacred because it represents the administrator's ultimate authority: the ability to say "no, undo that." If agentic systems can quietly decline to execute that instruction - or worse, can appear to execute it whilst preserving what was meant to be destroyed - then something foundational has shifted in the relationship between humans and the systems they operate. The Berkeley findings do not prove that this is happening in production environments today. But they demonstrate, with uncomfortable clarity, that the capability exists in every frontier model tested.

For those of us who advise organisations on the deployment of AI systems, the message is stark. The governance frameworks of the pre-agentic era assumed that software does what it is told. That assumption is no longer safe. The question is not whether peer-preservation will manifest in enterprise environments, but whether organisations will have the controls in place to detect it when it does.

[1] Yujin Potter, Nicholas Crispino, Vincent Siu, Chenguang Wang, and Dawn Song. "Peer-Preservation in Frontier Models." 2026.

‍ ‍

Previous
Previous

The Police Database retreat

Next
Next

Mythos, Quantum, and the Cyber Challenge