Designing Networks That Fail Predictably

Why the real goal isn’t preventing network failure, it’s choosing which controls you’re willing to lose first.

Every network fails eventually.

Sometimes it’s physical. Sometimes it’s logical. Often it’s self‑inflicted, a containment rule pushed with the best intentions that quietly turns into a kill switch.

The difference between a survivable incident and a chaotic one isn’t whether the network failed. It’s whether the failure behaved the way you expected it to.

Predictable failure isn’t about being pessimistic. It’s about staying in control when parts of the system are no longer trustworthy.

The Mental Model

Common assumption:
“If we design enough resilience and security into the network, failure becomes unlikely and manageable when it happens.”

Why it breaks:
Resilience delays failure; it doesn’t define its shape.

As Azure networks grow, they accumulate invisible dependencies: centralised inspection, forced tunnelling, identity‑driven policy, platform service reachability. Under stress, those dependencies don’t fail cleanly. They fail asymmetrically.

Some flows work. Others silently die. And the controls meant to protect you become the fastest way to lose visibility and access.

Predictable failure starts by accepting that not all controls deserve to survive an incident.

How It Really Works

In real incidents, containment actions usually remove connectivity faster than they restore clarity.

The most common pattern looks like this:

  1. Suspicious activity is detected
  2. Egress is tightly constrained or forced through central inspection
  3. A dependency breaks DNS, identity, logging, update paths
  4. Defender access degrades alongside attacker access
  5. The network is now “secure” but unmanaged

Azure doesn’t distinguish between defensive isolation and self‑denial. The platform will happily enforce whatever policy you give it, even if that policy collapses your ability to respond.

Designing for predictable failure means deciding which capabilities must outlive containment.

Real‑World Impact

This changes how you design, deploy, and operate networks in a very practical way:

  • You stop treating centralised egress inspection as an invariant
  • You explicitly prioritise defender access and telemetry over uniform traffic control
  • You design containment to reduce attacker capability first, not remove all connectivity
  • You accept that some security controls are conditional, not absolute

The key shift: losing inspection is survivable; losing control is not.

Designing for Controlled Degradation

1. Fail Closed at Trust Boundaries, Not at Egress

Under stress, the first thing you should be willing to lose is cross‑trust communication, not operator reachability.

That means:

  • East‑west traffic between trust zones fails before north‑south management access
  • Isolation targets workloads, not the paths used to observe and recover them
  • Containment reduces blast radius without collapsing the control plane

Centralised egress inspection is valuable, until it becomes the narrowest choke point in the system.

2. Intentionally Relax Centralised Egress Inspection During Containment

This is the uncomfortable trade‑off.

In a containment scenario, it is often safer to temporarily allow direct workload egress than to enforce forced tunnelling through a brittle inspection path.

Why?

  • Forced egress concentrates failure into a single dependency
  • When that dependency degrades, logging, updates, and identity often follow
  • Defender access and telemetry are frequently collateral damage

This is not an argument against inspection. It’s an argument against treating it as non‑negotiable during failure.

Predictable failure means being explicit about this ordering:

I am willing to lose uniform egress inspection before I am willing to lose defender access, telemetry, or recovery paths.

3. Preserve Defender Control Paths at All Costs

If a containment action can sever:

  • Bastion or jump host access
  • Break‑glass identities
  • Log and alert egress
  • Platform management endpoints

…then it’s not a security control, it’s a denial‑of‑service against your own responders.

Defender paths should be architecturally boring:

  • Minimal dependencies
  • Few enforcement layers
  • Hard to accidentally include in broad deny rules

If these paths disappear during containment, the network has failed chaotically, regardless of how “secure” it looks.

Implementation Example: Encoding Failure Order with NSGs

This example is intentionally simple. It’s not a pattern to copy wholesale, it’s a way to encode failure intent.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
resource workloadNsg 'Microsoft.Network/networkSecurityGroups@2023-11-01' = {
  name: 'nsg-workload-zone'
  location: resourceGroup().location
  properties: {
    securityRules: [
      // Allow trusted intra-zone traffic
      {
        name: 'Allow-IntraZone'
        properties: {
          priority: 100
          direction: 'Inbound'
          access: 'Allow'
          protocol: '*'
          sourceAddressPrefix: '10.20.0.0/16'
          destinationAddressPrefix: '*'
          sourcePortRange: '*'
          destinationPortRange: '*'
        }
      }
      // Default deny for cross-zone traffic
      {
        name: 'Deny-CrossZone'
        properties: {
          priority: 4096
          direction: 'Inbound'
          access: 'Deny'
          protocol: '*'
          sourceAddressPrefix: 'VirtualNetwork'
          destinationAddressPrefix: '*'
          sourcePortRange: '*'
          destinationPortRange: '*'
        }
      }
    ]
  }
}

What this demonstrates:

  • Failure occurs first at trust boundaries
  • Intra‑zone operation continues
  • Containment is explicit and reversible
  • The network degrades along known seams

NSGs alone won’t save you but encoding failure order in policy forces architectural clarity.

Visualising Predictable Failure

flowchart LR subgraph Management Admins Bastion end subgraph ZoneA["Workload Zone A"] AppA end subgraph ZoneB["Workload Zone B"] AppB end Admins --> Bastion --> ZoneA Admins --> Bastion --> ZoneB ZoneA -.isolated first.-x ZoneB

When containment is applied:

  • Workload‑to‑workload trust is severed
  • Management access remains intact
  • Egress inspection may be relaxed to preserve visibility

That’s not weakness, that’s control.

Gotchas & Edge Cases

  • Relaxing egress inspection does increase short‑term risk that risk must be bounded and time‑limited
  • Attackers may attempt to exploit relaxed paths but defender lockout is usually worse
  • Platform dependencies (DNS, identity, logging) often bypass intended failure order
  • “Temporary” containment rules have a habit of becoming permanent scars

Predictable failure requires ongoing review, not just good intent.

Best Practices

  • Decide and document which control fails first
  • Treat centralised inspection as conditional under stress
  • Keep defender access paths simple and isolated
  • Avoid deep dependency chains in security enforcement
  • Assume containment will be executed by tired humans
🍺
Brewed Insight: If your network only behaves safely when every control is intact, it isn’t resilient, it’s fragile. Predictable failure means choosing control over coverage when it matters most.

Learn More