Azure networks rarely fall over outright.
They drift.
Latency stretches just enough to trip timeouts. SNAT usage inches upward until one busy afternoon tips it over. Traffic takes a longer path than it did last week, and nobody notices until customers do.
If your networking strategy only detects outages, you’re designing for yesterday’s problems.
This post makes a deliberate argument: programmable Azure networks depend on observability first, and not the dashboard‑heavy kind. The kind that tells you when behaviour changes, not just when things break.
The Mental Model
The common assumption
“If the network is up and packets are flowing, it’s fine.”
Why it breaks
Azure networking is a distributed software system. Paths change, capacity shifts, policies evolve, and platform behaviour is opaque by default. Most failures are gradual and cumulative, not binary.
An observable network doesn’t ask: “Is it reachable?”
It asks: “Is it behaving the same way it did yesterday, and if not, do I know why?”
That shift matters because automation amplifies whatever assumptions you bake in. Automating a blind network just helps you fail faster.
How It Really Works
Azure already emits an enormous amount of network telemetry. The problem isn’t access — it’s signal selection.
At a practical level, network observability in Azure comes from four places:
- Metrics that show pressure (latency, SNAT ports, throughput)
- Logs that show decisions (NSG allows/denies, firewall policy matches)
- Topology state that shows intent vs reality (effective routes, effective rules)
- Platform events that show change (maintenance, scaling, backend movement)
Treating all of these as equal is a mistake. Observable design is about choosing behavioural signals that map to outcomes you actually care about.
If everything is a signal, nothing is.
The Control Loop (Not a Pipeline)
This is a control loop, not a deployment pipeline.
The important design decision is the decision point. That’s where you choose whether a human, a runbook, or a policy engine gets involved — and under what conditions.
Real‑World Impact
Designing for observability changes concrete engineering decisions:
Reliability
You detect partial failure before it becomes customer‑visible.Security
You notice traffic patterns changing, not just packets being blocked.Cost
You see pressure building (egress, SNAT, firewall throughput) before it becomes an incident or a bill shock.Change safety
You can observe blast radius instead of relying on change success messages.
Most importantly, it lets you automate safely. You’re no longer reacting to symptoms — you’re responding to trends.
What Signals Are Actually Worth Keeping
If you had to delete half your network telemetry tomorrow, these are the signals worth defending in Azure designs:
SNAT port utilisation on NAT Gateway or Load Balancer
Because exhaustion is silent until it isn’t.NSG or Firewall deny rate trends, not individual denies
Because behaviour shifts matter more than single events.Path length or next‑hop changes via effective routes
Because unexpected hair‑pinning kills latency and trust.
Everything else is secondary until you’ve made these observable and explainable.
Implementation Examples
Baseline Signal Emission (Portal)
For production Azure networks, this should be non‑negotiable:
- NSG Flow Logs v2 enabled
- Firewall logs sent to Log Analytics (if used)
- Load Balancer and NAT Gateway metrics enabled
- Diagnostics standardised across environments
This isn’t about dashboards. It’s about ensuring signals exist when you need to ask hard questions.
Declarative Observability with Bicep
| |
This snippet is intentionally boring.
Observable networks are built from consistency, not cleverness.
Programmability comes later, once the signals are trustworthy.
Closing the Loop Without Hurting Yourself
Not every signal should trigger automation. In fact, most shouldn’t.
Good automation candidates:
- Scaling NAT Gateway based on sustained SNAT pressure
- Flagging unexpected route changes for review
- Opening incidents with topology context attached
Bad automation candidates:
- Reacting automatically to transient route updates
- Blocking traffic based on short‑lived anomaly spikes
- Remediating without understanding blast radius
The rule of thumb is simple:
If you wouldn’t trust the signal to wake you at 3am, don’t let it reconfigure your network.
Gotchas and Edge Cases
- Control‑plane state often lags data‑plane reality
- Azure platform changes can shift “normal” behaviour overnight
- High‑cardinality logs can drown out the signals you care about
- Observability costs money — design deliberately
Day‑2 networking failures are usually design omissions, not platform bugs.
Best Practices
- Treat observability as a first‑class requirement, not an add‑on
- Design signals around outcomes, not resource types
- Separate telemetry, interpretation, and action
- Baseline behaviour before you automate anything
- Revisit observability whenever traffic patterns change
Observability isn’t about seeing more, it’s about knowing what matters.