← Back to Blog
Domenic DiNatale ·

Human Error Is Predictable. Cascading Failure Is Optional.

By Domenic DiNatale

After every breach, someone gets blamed.

The employee who clicked the link. The administrator who reused a password. The contractor who left a storage bucket exposed. The developer who didn't validate input. The security team that missed the alert.

This framing is comfortable because it comes with a solution: better training. Stricter policies. Sharper consequences. If we could just get people to behave correctly, the problem would be under control.

That comfort is the problem.

We've known for decades — in aviation, in nuclear operations, in healthcare — that human error rates are not a problem to be solved. They're a constant to be designed around. Humans make mistakes at a measurable, predictable rate that doesn't change meaningfully with training, incentives, or threat of consequence. The error rate compresses slightly at the margins. The underlying distribution holds.

Every domain that has achieved meaningful reliability has accepted this. Security, with notable exceptions, has not. We continue to design systems that require human perfection and then act surprised when humans are imperfect.

The Human Firewall and Why It Fails

The term "human firewall" tells you everything about the mental model behind it.

A firewall inspects traffic, enforces rules, and blocks what shouldn't pass. The metaphor imagines a human who can do the same: read every email, evaluate every link, identify every malicious payload, and stop it before it enters the system. Trained well enough, attentive enough, motivated enough.

This model fails for the same reason any system designed around perfect performance fails: performance isn't perfect. It never is, and it never will be.

The statistics on phishing click rates are well-documented. They don't move much regardless of training program quality. More relevant than the aggregate click rate is the distributional shape: across any sufficiently large organization, some percentage of employees will click any given phishing link. It doesn't matter how good the training is, how severe the consequences, or how obvious the signs. The tail of the distribution holds.

This isn't a failure of human cognition. It's a feature of it. The same pattern-matching, context-filling behavior that makes humans efficient — the ability to read a truncated URL and infer the destination, to trust communications that look legitimate because legitimate communications usually look that way — is exactly what phishing exploits. You can't train out the cognitive architecture that makes humans functional.

Organizations that keep investing in "security awareness training" as a primary control are trying to compress the tail. They're not eliminating it. They're spending on training when they should be spending on architecture that treats the tail as given.

What Aviation Figured Out

Commercial aviation operates in one of the most hostile environments imaginable: complex machinery, lethal consequences of failure, regulatory pressure, high public visibility, and human operators who are tired, rushed, and working under cognitive load. It also has one of the strongest safety records of any high-stakes domain in history.

That record was not built by training pilots to never make mistakes. It was built by designing systems that expect mistakes and prevent them from becoming catastrophes.

Checklist protocols exist not because experienced pilots can't remember what to do. They exist because experience doesn't protect against cognitive errors under stress, and the research is unambiguous about it. Dual-control cockpits exist not because one pilot might be incompetent. They exist because one pilot might miss something a second pilot will catch. Automated warnings exist not because pilots aren't watching the instruments. They exist because humans don't watch everything simultaneously, and a gap in attention is normal — not exceptional.

The reliability comes from layered redundancy. From assuming that any given check will fail at some rate, and designing the system so that no single failure propagates into a catastrophic outcome.

Aviation didn't achieve its safety record by solving human error. It achieved it by rendering individual human error insufficient to cause a disaster on its own.

Error Is Predictable. Cascade Is Not Inevitable.

Here is the distinction that matters for security architecture: human error is predictable and largely unavoidable; cascading failure is a design decision.

When an employee clicks a malicious link, an error has occurred. That was expected. It was always going to happen with some probability. The only interesting question is what happens next.

In a well-designed system, a clicked link delivers a payload to an endpoint segmented from sensitive systems. The payload executes with user-level permissions in a constrained environment. Behavioral monitoring detects unusual activity before lateral movement is possible. The blast radius of the original error is contained by the architecture around it.

In a poorly designed system, a clicked link delivers a payload to an endpoint with access to internal file shares, authentication credentials, and service accounts. The payload inherits all the permissions of the logged-in user. There is no monitoring designed to detect behavior that looks like normal user activity but isn't. The original error cascades through every trust assumption the architecture made.

The employee made the same mistake in both scenarios. The difference isn't the human. It's the system's response to the human.

Organizations that focus on why the employee clicked the link are asking the wrong question. Organizations that focus on why the system didn't contain the consequences are asking the right one.

The Reliability Engineering Lesson Security Won't Learn

In reliability engineering — the discipline concerned with system behavior under real-world conditions — there's a concept called fault isolation. The principle is straightforward: design systems so that when components fail (and they will), the failure doesn't propagate to adjacent components. Firewalls, blast doors, circuit breakers — physically and logically — enforce this principle.

Healthcare got here too. The Keystone Initiative in ICUs reduced central-line bloodstream infections by 66 percent — not by hiring better nurses, but by implementing a five-item checklist that interrupted the error cascade before it reached the patient. Surgical checklists don't exist because surgeons forget how to operate. They exist because the research is unambiguous: structured verification catches errors at a consistent rate that expertise alone doesn't.

Nuclear safety procedures operate on the same assumption: any given operator might make any given mistake. The design accounts for it. Multiple independent barriers, each capable of stopping what the previous one might pass.

Security architecture has parallel tools: network segmentation, least-privilege access, endpoint isolation, anomaly detection, offline backups, hardware-backed key management. The difference is that in security, these tools are often described as "defense in depth" — incremental additions to a base model — rather than core requirements for any system that expects humans to interact with it.

The framing matters. "Defense in depth" implies that each layer is extra. In reliability engineering, layered redundancy isn't extra. It's the design.

The Blame Cycle and What It Costs

There's a structural reason organizations keep returning to human blame: it's cheaper in the short term than architectural remediation.

Retrain the employee. Update the acceptable use policy. Run a phishing simulation to reinforce behavior. This closes the ticket. It demonstrates activity. It produces a documentation trail. None of it changes the architecture that made the incident possible.

Architectural change is expensive, slow, and disruptive. Changing network segmentation affects operational workflows. Implementing least-privilege access requires auditing every permissions assignment in an environment that grew organically over years. Deploying behavioral monitoring requires someone to maintain it and respond to it.

Training is cheaper. The problem is that cheaper-in-the-short-run compounds. Every incident blamed on a human is an incident whose architectural lesson wasn't learned. The next one is just as likely.

Aviation didn't get cheap. The regulatory and operational investment in systemic safety is substantial. But the cost of that investment is lower than the alternative — and more importantly, the industry measured the alternative, accepted the comparison, and built accordingly.

Security doesn't often measure the alternative that clearly. Organizations rarely account for the full cost of repeated incidents: the regulatory exposure, the operational disruption, the downstream liability, the trust damage with customers and partners. If they did, the calculus on architectural investment would change.

What Systems That Absorb Failure Look Like

The practical implication of accepting human error as a constant isn't resignation. It's a different set of design questions.

Instead of: How do we prevent employees from clicking phishing links? Ask: If an employee clicks a phishing link, how quickly does the blast radius stabilize?

Instead of: How do we ensure credentials are never reused? Ask: If a credential is compromised, what can an attacker actually reach from that entry point?

Instead of: How do we detect insider threats through behavioral training? Ask: Does our architecture allow an insider — or someone with insider credentials — to exfiltrate meaningful data without triggering detection?

These questions are architectural. They don't have answers in the training catalog. They have answers in the segmentation model, the permissions structure, the monitoring coverage, and the response capability.

Systems that absorb failure share a common property: they don't rely on any single component — human or technical — to perform perfectly. They assume each component will fail at some rate, and they layer responses so that individual failure doesn't reach the critical path.

The Leadership Decision Inside the Design

There's a version of this that lives entirely in the technical domain: segmentation models, permission scopes, detection systems. That version matters.

But there's a version that lives in leadership. Every architecture reflects a set of tradeoffs that leadership made — often implicitly — about speed, cost, and acceptable risk. A flat network with broad permissions is fast to operate and cheap to maintain. It's also an environment where a single compromised endpoint can become a catastrophe.

That tradeoff didn't happen by accident. It happened because someone decided operational velocity was worth the risk exposure — or because no one asked the question explicitly.

Designing systems that absorb human failure requires leaders who are willing to make those tradeoffs consciously. Not "we accept the risk" as a disclaimer appended to a decision, but: we understand that our architecture relies on consistent human behavior; we understand that behavior is probabilistically bounded; and we've decided that probability is acceptable given our constraints.

Most organizations haven't had that conversation. They've deployed controls, trained employees, and hoped the distribution would be kinder to them than to whoever got breached last quarter.

The ones who build something more durable are the ones who stop treating human error as an anomaly to be corrected and start treating it as a design parameter to be accommodated.

Human error isn't the problem.

Building systems that can't survive it is.

cybersecurity architecture systems-thinking reliability