Designing Networks That Fail Predictably

Every network fails eventually.

Sometimes it’s physical. Sometimes it’s logical. Often it’s self‑inflicted, a containment rule pushed with the best intentions that quietly turns into a kill switch.

The difference between a survivable incident and a chaotic one isn’t whether the network failed. It’s whether the failure behaved the way you expected it to.

Predictable failure isn’t about being pessimistic. It’s about staying in control when parts of the system are no longer trustworthy.

The Mental Model

Common assumption:
“If we design enough resilience and security into the network, failure becomes unlikely and manageable when it happens.”

Why it breaks:
Resilience delays failure; it doesn’t define its shape.

As Azure networks grow, they accumulate invisible dependencies: centralised inspection, forced tunnelling, identity‑driven policy, platform service reachability. Under stress, those dependencies don’t fail cleanly. They fail asymmetrically.

Some flows work. Others silently die. And the controls meant to protect you become the fastest way to lose visibility and access.

Predictable failure starts by accepting that not all controls deserve to survive an incident.

How It Really Works

In real incidents, containment actions usually remove connectivity faster than they restore clarity.

The most common pattern looks like this:

Suspicious activity is detected
Egress is tightly constrained or forced through central inspection
A dependency breaks DNS, identity, logging, update paths
Defender access degrades alongside attacker access
The network is now “secure” but unmanaged

Azure doesn’t distinguish between defensive isolation and self‑denial. The platform will happily enforce whatever policy you give it, even if that policy collapses your ability to respond.

Designing for predictable failure means deciding which capabilities must outlive containment.

Real‑World Impact

This changes how you design, deploy, and operate networks in a very practical way:

You stop treating centralised egress inspection as an invariant
You explicitly prioritise defender access and telemetry over uniform traffic control
You design containment to reduce attacker capability first, not remove all connectivity
You accept that some security controls are conditional, not absolute

The key shift: losing inspection is survivable; losing control is not.

Designing for Controlled Degradation

1. Fail Closed at Trust Boundaries, Not at Egress

Under stress, the first thing you should be willing to lose is cross‑trust communication, not operator reachability.

That means:

East‑west traffic between trust zones fails before north‑south management access
Isolation targets workloads, not the paths used to observe and recover them
Containment reduces blast radius without collapsing the control plane

Centralised egress inspection is valuable, until it becomes the narrowest choke point in the system.

2. Intentionally Relax Centralised Egress Inspection During Containment

This is the uncomfortable trade‑off.

In a containment scenario, it is often safer to temporarily allow direct workload egress than to enforce forced tunnelling through a brittle inspection path.

Why?

Forced egress concentrates failure into a single dependency
When that dependency degrades, logging, updates, and identity often follow
Defender access and telemetry are frequently collateral damage

This is not an argument against inspection. It’s an argument against treating it as non‑negotiable during failure.

Predictable failure means being explicit about this ordering:

I am willing to lose uniform egress inspection before I am willing to lose defender access, telemetry, or recovery paths.

3. Preserve Defender Control Paths at All Costs

If a containment action can sever:

Bastion or jump host access
Break‑glass identities
Log and alert egress
Platform management endpoints

…then it’s not a security control, it’s a denial‑of‑service against your own responders.

Defender paths should be architecturally boring:

Minimal dependencies
Few enforcement layers
Hard to accidentally include in broad deny rules

If these paths disappear during containment, the network has failed chaotically, regardless of how “secure” it looks.

Implementation Example: Encoding Failure Order with NSGs

This example is intentionally simple. It’s not a pattern to copy wholesale, it’s a way to encode failure intent.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
resource workloadNsg 'Microsoft.Network/networkSecurityGroups@2023-11-01' = {
  name: 'nsg-workload-zone'
  location: resourceGroup().location
  properties: {
    securityRules: [
      // Allow trusted intra-zone traffic
      {
        name: 'Allow-IntraZone'
        properties: {
          priority: 100
          direction: 'Inbound'
          access: 'Allow'
          protocol: '*'
          sourceAddressPrefix: '10.20.0.0/16'
          destinationAddressPrefix: '*'
          sourcePortRange: '*'
          destinationPortRange: '*'
        }
      }
      // Default deny for cross-zone traffic
      {
        name: 'Deny-CrossZone'
        properties: {
          priority: 4096
          direction: 'Inbound'
          access: 'Deny'
          protocol: '*'
          sourceAddressPrefix: 'VirtualNetwork'
          destinationAddressPrefix: '*'
          sourcePortRange: '*'
          destinationPortRange: '*'
        }
      }
    ]
  }
}

What this demonstrates:

Failure occurs first at trust boundaries
Intra‑zone operation continues
Containment is explicit and reversible
The network degrades along known seams

NSGs alone won’t save you but encoding failure order in policy forces architectural clarity.

Visualising Predictable Failure

flowchart LR subgraph Management Admins Bastion end subgraph ZoneA["Workload Zone A"] AppA end subgraph ZoneB["Workload Zone B"] AppB end Admins --> Bastion --> ZoneA Admins --> Bastion --> ZoneB ZoneA -.isolated first.-x ZoneB

When containment is applied:

Workload‑to‑workload trust is severed
Management access remains intact
Egress inspection may be relaxed to preserve visibility

That’s not weakness, that’s control.

Gotchas & Edge Cases

Relaxing egress inspection does increase short‑term risk that risk must be bounded and time‑limited
Attackers may attempt to exploit relaxed paths but defender lockout is usually worse
Platform dependencies (DNS, identity, logging) often bypass intended failure order
“Temporary” containment rules have a habit of becoming permanent scars

Predictable failure requires ongoing review, not just good intent.

Best Practices

Decide and document which control fails first
Treat centralised inspection as conditional under stress
Keep defender access paths simple and isolated
Avoid deep dependency chains in security enforcement
Assume containment will be executed by tired humans

🍺

Brewed Insight: If your network only behaves safely when every control is intact, it isn’t resilient, it’s fragile. Predictable failure means choosing control over coverage when it matters most.