Networks That Fail Safely Under Breach Conditions

Most Azure networks are designed like espresso machines: finely tuned, pressure‑balanced, and very unhappy when you pull the wrong lever.

Under steady state, that precision feels like quality engineering.
Under breach conditions, it often turns out to be fragility.

Assume breach changes the question from “How do I keep them out?” to a much less comfortable one:

“What happens to my network when someone is already in and I start breaking things on purpose?”

The Mental Model

Common assumption:
If a network is secure enough to prevent compromise, it will behave sensibly once compromised.

Why it breaks:
Most Azure networks are optimised for normal operations:

Shared route tables for consistency
Centralised inspection paths
Broad east‑west reachability “inside the VNet”
Security controls designed to be added, not activated

These choices feel clean and efficient, right up until containment is required.
At that point, the network stops being infrastructure and becomes a live incident tool.

Networks that were never designed to fail don’t fail gracefully. They fail globally.

How It Really Works

Once breach is assumed, the network’s job changes.

It is no longer primarily a preventative control.
It is an active constraint system that must support human decision‑making under stress.

Fail‑safe networks behave differently in four key ways:

Localised trust – Compromise in one subnet does not imply reachability elsewhere.
Predictable degradation – Containment actions reduce capability in known, bounded ways.
Asymmetric pain – Attackers lose freedom faster than defenders lose visibility.
Defender‑first control – Emergency actions don’t depend on shared components that attackers may already influence.

The absence of these traits is what makes a network brittle, regardless of how “secure” it looks on a diagram.

Real‑World Impact

This reframes several everyday Azure design decisions.

Shared Route Tables Are Actively Unsafe Under Breach

This is the strongest claim in this post, and it’s deliberate.

A single shared UDR (User‑Defined Route) per VNet or app tier is not just sub‑optimal, it is dangerous under assume‑breach conditions.

Here’s what actually happens during an incident:

A single subnet is suspected of compromise.
Containment requires changing egress or east‑west routing.
The route table is shared.
Isolation becomes a multi‑subnet blast event.
Legitimate workloads break.
On‑call engineers hesitate.
The attacker benefits from defender uncertainty.

At that moment, routing is no longer a connectivity concern.
It is a blast‑radius control plane and you’ve centralised it.

If isolating one subnet requires asking “what else uses this route table?”, the network does not fail safely.

Fail‑Safe Means “Small, Reversible Mistakes”

Under pressure, defenders will make imperfect calls.
Fail‑safe networks are designed around that reality.

Good failure looks like:

One subnet isolated, not a whole tier
One workload degraded, not all egress
Defender access preserved
Telemetry still flowing
Clear rollback paths

Bad failure looks like:

Widespread outages caused by containment
Automation immediately undoing emergency changes
Engineers afraid to touch the network mid‑incident

Assume breach is not about perfect response. It’s about survivable response.

Implementation Examples

This is about shape, not templates.

Subnet‑Scoped Isolation via NSGs

One practical pattern is designing isolation to be subnet‑local, without touching shared routing or central firewalls.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
resource isolationNsg 'Microsoft.Network/networkSecurityGroups@2023-11-01' = {
  name: 'nsg-isolation-app1'
  location: resourceGroup().location
  properties: {
    securityRules: [
      {
        name: 'Deny-All-Inbound'
        properties: {
          priority: 100
          direction: 'Inbound'
          access: 'Deny'
          protocol: '*'
          sourceAddressPrefix: '*'
          destinationAddressPrefix: '*'
          sourcePortRange: '*'
          destinationPortRange: '*'
        }
      }
      {
        name: 'Allow-Defender-Access'
        properties: {
          priority: 110
          direction: 'Inbound'
          access: 'Allow'
          protocol: 'Tcp'
          sourceAddressPrefix: 'AzureBastion'
          destinationPortRange: '22'
        }
      }
    ]
  }
}

Boundary of this approach:
NSGs don’t override routing, and they don’t magically preserve return paths. They work because routing blast radius has already been constrained elsewhere. If your isolation depends on a shared UDR, this pattern collapses.

Architectural Shape: Containment Without Global Impact

flowchart LR subgraph VNet subgraph ZoneA["App Zone A"] A1[App Subnet] end subgraph ZoneB["App Zone B"] B1[App Subnet] end subgraph Control["Defender Control Plane"] Bastion Logs end end A1 -->|Explicit, Limited Paths| B1 A1 -->|Telemetry Only| Logs Bastion --> A1

The important property here is not segmentation it’s decision locality.

Containment decisions apply where the problem exists, not everywhere trust happens to be shared.

Gotchas & Edge Cases

Automation can be an attacker’s ally
Policy and IaC pipelines that auto‑remediate “drift” will happily undo containment unless explicitly designed not to. This is not a tooling bug, it’s a design failure.
Service and private endpoints can surprise you
Under containment, DNS resolution and endpoint reachability may not follow your mental model of “blocked traffic”. Treat them as reachability shortcuts that need explicit consideration.
Platform traffic still matters
Over‑zealous denies can break health probes, backups, or agent connectivity. Fail‑safe design preserves defender visibility first.

Best Practices

Treat routing as a blast‑radius control, not a convenience layer.
Avoid shared UDRs where containment scope matters.
Design isolation actions that are subnet‑local and reversible.
Preserve defender access and telemetry under all containment states.
Assume automation will act and decide whether it should during incidents.

🍺

Brewed Insight: I’ve made the shared‑UDR decision myself, more than once because it looked clean and “enterprise‑ready”.
Working through assume breach has forced me to re‑learn that convenience today often becomes fragility tomorrow, and that’s now something I actively push back on in my own designs.