← Back to Blog
Domenic DiNatale ·
Designing Systems That Fail Well

Designing Systems That Fail Well

By Domenic DiNatale

Every system fails eventually. The ones designed to fail well are the ones you survive.

This is the thread that runs through every post in this series, from blast radius to ransomware to compliance to incident response. Security isn't a state you achieve and maintain — it's a property of how your systems behave when things go wrong. The most important security question isn't "how do we prevent failure?" It's "when we fail, what happens next?"

That question reframes everything. Prevention remains important, but it's no longer the primary goal — containment and recovery take on equal weight. Controls are evaluated not just on whether they block attacks but on whether they limit damage and enable recovery when blocking fails. Architecture is designed with an adversary in mind, not just a design brief.

This is what resilience actually means in security: not robustness, which is the ability to withstand stress, but the ability to absorb disruption, adapt, and recover. Robust systems try not to break. Resilient systems are designed to break well.

What Failing Well Requires

Failing well requires three things: detection, containment, and recovery. Each is an architectural property, not a tool you can purchase and deploy. Each needs to be built in, not bolted on.

Detection is the ability to know when something has gone wrong, quickly enough to matter. This sounds obvious — of course you want to detect problems. But detection at the level that makes a real difference requires observable systems: comprehensive logging, centralized telemetry, well-tuned alerting, and human operators who understand the normal baseline well enough to recognize deviations. Most organizations have logs. Far fewer have detection systems that are actually calibrated to catch adversarial behavior early in the attack chain, before the blast radius becomes catastrophic.

The detection gap is where dwell time comes from. Attackers who are inside an environment for weeks or months aren't invisible — they're generating signals. Those signals aren't being recognized because the detection systems aren't tuned to recognize them, or the alert volume is high enough that genuine signals get buried, or nobody is looking. Detection failure is usually an operational failure layered on a design failure: the signals were there, but the system wasn't built to surface them and nobody was watching.

Containment is the architectural property that limits how far damage propagates once something goes wrong. This is blast radius, discussed at the start of this series. Segmented networks. Least-privilege access. Short-lived credentials. Service accounts that can only reach what they need. Backup systems that are isolated from production. Each of these limits the consequences of a single point of compromise. Each of them requires making design decisions with failure in mind — deciding what a compromised component should be able to reach, not just what it needs to reach to function normally.

Recovery is the ability to return to a known-good state after a failure — and to do so quickly enough that the disruption is bounded. This requires two things that are architecturally distinct: clean recovery artifacts (backups that are verified, that are isolated, that can actually restore systems to a functional state) and infrastructure that can be rebuilt from known state rather than requiring forensic analysis of potentially-compromised systems. Immutable infrastructure — systems defined as code and deployable from a clean source of truth — is a recovery enabler that most organizations underinvest in.

The Shift in Design Culture

The technical requirements for failing well are knowable. They're documented in resilience engineering literature, in security frameworks, in architecture guides. The harder problem is cultural: most systems aren't designed to fail because teams don't think about failure as a design requirement at the same level as functionality.

Features have requirements. Performance has targets. Scalability has design criteria. Failure modes are often addressed only reactively — after an outage, after a breach, after a near-miss. The postmortem happens, the immediate cause is fixed, and the system continues without the structural changes that would make the failure mode less likely or less damaging in the future.

This is partly an incentive problem. Designing for failure is invisible when it works — nobody knows you successfully contained a potential breach because the network was segmented and the attacker couldn't move laterally. The segmentation shows up as operational overhead and configuration complexity, not as a visible security win. The wins from failure-mode design are counterfactual, and counterfactuals are hard to sell.

The shift requires treating resilience as a first-class design requirement — something that gets articulated as a design goal, evaluated during architecture review, and measured over time. "This system should be able to contain a compromise to one service" is a design requirement. "The system should be recoverable from backup within four hours" is a design requirement. "Detection should surface an active attacker within 24 hours of initial compromise" is a design requirement. Stating these as requirements creates accountability for building systems that meet them.

Closing the Loop

The series that this post closes started with a simple observation: security failures are usually architectural problems, not surface-level ones. Authentication failures. Lateral movement. Blast radius. Dependency compromise. Incident response gaps. Maturity theater. Each of these traces back to architectural decisions — or the absence of deliberate architectural thinking.

The implication is that security is not primarily a tooling problem or a process problem. It's a design problem. Systems need to be designed with adversarial conditions in mind. They need to be designed to limit damage when controls fail. They need to be designed to surface failures quickly, contain them structurally, and recover from them cleanly.

This doesn't require perfect foresight or unlimited engineering resources. It requires consistently asking the question that most design processes skip: when this fails, what happens? Not if it fails — when. Every system fails eventually. The question is whether yours fails in a way that you can detect, contain, and recover from — or in a way that you can't.

Designing for failure well is the most underrated security investment you can make. It's invisible when it works. It's the difference between an incident and a catastrophe when it doesn't.

Build systems that fail well. Everything else in security gets easier.

This post is part of a series on security as an architectural problem. Read the full series on the Intellitech blog.

cybersecurity architecture resilience failure design systems thinking