Your AI passed every test. That's exactly the problem.

Here is an uncomfortable proposition: the better your AI system performs, the more dangerous it becomes. Not because the outputs are wrong, but because they are almost always right. And “almost always right” is precisely the condition that makes human oversight fail.

The pattern is all too predictable – when a new AI system is deployed much is made of the surrounding governance regime: human in the loop validation, sign-off procedures, review checkpoints – all of the appropriate checks and balances seem to be there. But six months later, those checkpoints have quietly become formalities. The humans are still there on paper, but how closely are they really checking the system?

The irony of automation

In 1983, the researcher Lisanne Bainbridge articulated what she called the “irony of automation”: the more reliable an automated system, the less capable the humans overseeing it become. With infrequent errors to catch, reviewers lose the calibration, attentiveness, and pattern recognition that oversight requires. The system’s success quietly hollows out the very safeguard designed to catch its failures.

This was first observed in the aviation and nuclear power industries, where safe operation was critical. It is now being rediscovered, at speed, in AI. In academic literature the phenomenon is labelled as automation bias - the tendency to treat machine outputs as reliable heuristics rather than recommendations requiring scrutiny. This change in approach should not be dismissed as laziness - it goes much deeper than that and appears to be a cognitive adaptation. It turns out that the human brain is not built for prolonged passive monitoring of systems that rarely fail, and we’re simply not very good at it.

The evidence is stark. A peer-reviewed study[1] gave radiologists AI-assisted chest X-ray tools, then deliberately corrupted a subset of outputs - false negative rates increased substantially. When the AI said, “nothing to see here,” clinicians deferred, even in cases they would have caught unaided. A separate mammography study[2] found that computer flagged false negatives halved detection rates compared to clinicians working without computer assistance at all. The assistance, when wrong, made experts worse than if the system had never existed.

Outside of academic studies, the real-world consequences of this effect can be devastating. In 2018 an Uber autonomous vehicle was involved in a fatal collision with a pedestrian pushing a bicycle. The safety driver, whose sole role was to catch what the AI missed, had become a passive occupant of the vehicle, and was not looking at the road in the crucial seconds prior to the collision.

In the case of Sruwijaya Air Flight 182 from Jakarta in 2021, it was found that pilots had become overreliant on their plane’s autopilot capabilities and failed to take instrument readings and other warning systems into account – had they done so disaster may have been averted. The automation had trained them, over thousands of hours, not to watch. Within five minutes of taking off the plane came down into the Java Sea killing all 62 passengers on board.

Passive monitoring of a reliable system is not oversight - it is merely theatre.

Combating passive monitoring fatigue

It is perhaps unsurprising that safety critical industries such as aviation and nuclear power have recognised the dangers of passive monitoring and taken steps to address the risks.

The Line Oriented Safety Audit (LOSA) programme embeds passive observers in the cockpit of commercial flights to assess crew behaviour under normal operating conditions – not in the sterile environment of a simulator, but in a real aircraft, in real time. These observers record how the crew interact with the aircraft systems, including documented operational procedures and any shortcuts or deviations. Recommendations are fed back to airline operators and aircraft manufacturers in a cycle of continual improvement.

Nuclear operators go further: they deliberately induce fault conditions during in-service testing to verify that control room staff can actually respond when the systems they trust fail.

Cybersecurity has normalised the same discipline. Phishing simulations - fake malicious emails sent to employees without warning - have become standard practice in well-governed organisations. Nobody considers them adversarial. They are understood to be the only reliable way to measure organisational resilience before a real attack arrives.

The AI fire drill

So how can we apply these same ideas to the fast-moving world of agentic AI?

The mechanic of the “AI fire drill” is straightforward. On a randomised, undisclosed schedule, a small proportion of AI outputs entering the human review workflow are replaced, or subtly corrupted, with pre-designed incorrect results. These are outputs that the reviewer should catch if performing genuine scrutiny rather than passive confirmation. Detection rates are logged. Trends are tracked over time. The results become a governance signal: not a basis for blame, but for system guardrails and process design.

Four design principles govern the effective implementation of such a check. First, injected errors must be plausible, not obvious. A crudely wrong output catches itself and tells you nothing. The error should sit at the boundary of the system’s confidence interval - the kind of mistake the AI could genuinely produce. Second, detection should require active engagement, not passive observation. Third, results must feed design. Low catch rates should be seen principally as a system and workflow problem and only secondarily as a people performance and training problem. Fourth, frequency must be unpredictable. A fire drill scheduled for the first Monday of each month produces vigilance on that morning. Baseline measurement requires genuine unpredictability.

A provocation worth considering: if your organisation ran an AI fire drill tomorrow, and the results came back showing only a 20% detection rate on injected errors, what would that mean for every AI-assisted decision made in the past year? What decisions were signed off by a human reviewer who was, in practice, a rubber stamp on a machine?

‍In the face of rapid adoption of AI technologies and the move to agentic workflows, the AI fire drill, or something like it, will become one of the defining governance practices of the next decade - whether organisations choose it voluntarily, or have it mandated by regulators.

The EU AI Act is coming

Article 14 of the EU AI Act imposes explicit human oversight requirements on high-risk AI systems - a category covering AI used in employment, credit, healthcare, law enforcement, border control, and access to essential services. The obligation is clear: deployers must ensure that human overseers can “properly understand” the system’s capabilities and limitations, “detect and address” signs of failure, and intervene when necessary.

What the Act does not yet specify is how to demonstrate that this capability exists and has not degraded over time. A system card produced eighteen months ago does not answer that question. Regulators will, in time, ask not whether a human was technically present in the loop, but whether that human was genuinely capable of catching errors. The organisations that can answer that question most confidently are those that have been running structured fire drills, logging detection rates, and using the results to drive continuous improvement.

The EU AI Act’s obligations phase in through 2026 and 2027. The window to build proactive oversight infrastructure, rather than retrofit it under regulatory pressure, is open now, but not indefinitely. A system designed without friction, without periodic challenge, and without evidence of genuine human capability should be seen as a significant liability.

The most important question an AI governance framework can answer is not “does our AI work?” It is “do our people still know when it doesn’t?”.

‍ ‍

[1] Bernstein et al (2023), Can incorrect artificial intelligence (AI) results impact radiologists, and if so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest radiography

[2] Alberdi et al (2004), Effects of incorrect computer-aided detection (CAD) output on human decision-making in mammography

‍ ‍

Your AI passed every test. That's exactly the problem.

Here is an uncomfortable proposition: the better your AI system performs, the more dangerous it becomes. Not because the outputs are wrong, but because they are almost always right. And “almost always right” is precisely the condition that makes human oversight fail.

The pattern is all too predictable – when a new AI system is deployed much is made of the surrounding governance regime: human in the loop validation, sign-off procedures, review checkpoints – all of the appropriate checks and balances seem to be there. But six months later, those checkpoints have quietly become formalities. The humans are still there on paper, but how closely are they really checking the system?

The irony of automation

In 1983, the researcher Lisanne Bainbridge articulated what she called the “irony of automation”: the more reliable an automated system, the less capable the humans overseeing it become. With infrequent errors to catch, reviewers lose the calibration, attentiveness, and pattern recognition that oversight requires. The system’s success quietly hollows out the very safeguard designed to catch its failures.

This was first observed in the aviation and nuclear power industries, where safe operation was critical. It is now being rediscovered, at speed, in AI. In academic literature the phenomenon is labelled as automation bias - the tendency to treat machine outputs as reliable heuristics rather than recommendations requiring scrutiny. This change in approach should not be dismissed as laziness - it goes much deeper than that and appears to be a cognitive adaptation. It turns out that the human brain is not built for prolonged passive monitoring of systems that rarely fail, and we’re simply not very good at it.

The evidence is stark. A peer-reviewed study[1] gave radiologists AI-assisted chest X-ray tools, then deliberately corrupted a subset of outputs - false negative rates increased substantially. When the AI said, “nothing to see here,” clinicians deferred, even in cases they would have caught unaided. A separate mammography study[2] found that computer flagged false negatives halved detection rates compared to clinicians working without computer assistance at all. The assistance, when wrong, made experts worse than if the system had never existed.

Outside of academic studies, the real-world consequences of this effect can be devastating. In 2018 an Uber autonomous vehicle was involved in a fatal collision with a pedestrian pushing a bicycle. The safety driver, whose sole role was to catch what the AI missed, had become a passive occupant of the vehicle, and was not looking at the road in the crucial seconds prior to the collision.

In the case of Sruwijaya Air Flight 182 from Jakarta in 2021, it was found that pilots had become overreliant on their plane’s autopilot capabilities and failed to take instrument readings and other warning systems into account – had they done so disaster may have been averted. The automation had trained them, over thousands of hours, not to watch. Within five minutes of taking off the plane came down into the Java Sea killing all 62 passengers on board.

Passive monitoring of a reliable system is not oversight - it is merely theatre.

Combating passive monitoring fatigue

It is perhaps unsurprising that safety critical industries such as aviation and nuclear power have recognised the dangers of passive monitoring and taken steps to address the risks.

The Line Oriented Safety Audit (LOSA) programme embeds passive observers in the cockpit of commercial flights to assess crew behaviour under normal operating conditions – not in the sterile environment of a simulator, but in a real aircraft, in real time. These observers record how the crew interact with the aircraft systems, including documented operational procedures and any shortcuts or deviations. Recommendations are fed back to airline operators and aircraft manufacturers in a cycle of continual improvement.

Nuclear operators go further: they deliberately induce fault conditions during in-service testing to verify that control room staff can actually respond when the systems they trust fail.

Cybersecurity has normalised the same discipline. Phishing simulations - fake malicious emails sent to employees without warning - have become standard practice in well-governed organisations. Nobody considers them adversarial. They are understood to be the only reliable way to measure organisational resilience before a real attack arrives.

The AI fire drill

So how can we apply these same ideas to the fast-moving world of agentic AI?

The mechanic of the “AI fire drill” is straightforward. On a randomised, undisclosed schedule, a small proportion of AI outputs entering the human review workflow are replaced, or subtly corrupted, with pre-designed incorrect results. These are outputs that the reviewer should catch if performing genuine scrutiny rather than passive confirmation. Detection rates are logged. Trends are tracked over time. The results become a governance signal: not a basis for blame, but for system guardrails and process design.

Four design principles govern the effective implementation of such a check. First, injected errors must be plausible, not obvious. A crudely wrong output catches itself and tells you nothing. The error should sit at the boundary of the system’s confidence interval - the kind of mistake the AI could genuinely produce. Second, detection should require active engagement, not passive observation. Third, results must feed design. Low catch rates should be seen principally as a system and workflow problem and only secondarily as a people performance and training problem. Fourth, frequency must be unpredictable. A fire drill scheduled for the first Monday of each month produces vigilance on that morning. Baseline measurement requires genuine unpredictability.

A provocation worth considering: if your organisation ran an AI fire drill tomorrow, and the results came back showing only a 20% detection rate on injected errors, what would that mean for every AI-assisted decision made in the past year? What decisions were signed off by a human reviewer who was, in practice, a rubber stamp on a machine?

In the face of rapid adoption of AI technologies and the move to agentic workflows, the AI fire drill, or something like it, will become one of the defining governance practices of the next decade - whether organisations choose it voluntarily, or have it mandated by regulators.

The EU AI Act is coming

Article 14 of the EU AI Act imposes explicit human oversight requirements on high-risk AI systems - a category covering AI used in employment, credit, healthcare, law enforcement, border control, and access to essential services. The obligation is clear: deployers must ensure that human overseers can “properly understand” the system’s capabilities and limitations, “detect and address” signs of failure, and intervene when necessary.

What the Act does not yet specify is how to demonstrate that this capability exists and has not degraded over time. A system card produced eighteen months ago does not answer that question. Regulators will, in time, ask not whether a human was technically present in the loop, but whether that human was genuinely capable of catching errors. The organisations that can answer that question most confidently are those that have been running structured fire drills, logging detection rates, and using the results to drive continuous improvement.

The EU AI Act’s obligations phase in through 2026 and 2027. The window to build proactive oversight infrastructure, rather than retrofit it under regulatory pressure, is open now, but not indefinitely. A system designed without friction, without periodic challenge, and without evidence of genuine human capability should be seen as a significant liability.

The most important question an AI governance framework can answer is not “does our AI work?” It is “do our people still know when it doesn’t?”.

[1] Bernstein et al (2023), Can incorrect artificial intelligence (AI) results impact radiologists, and if so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest radiography

[2] Alberdi et al (2004), Effects of incorrect computer-aided detection (CAD) output on human decision-making in mammography

The AI Fire Drill

Mythos, Quantum, and the Cyber Challenge

Agent overload – the growing problem of agentic sprawl