Routing as Code with Azure Route Server and Policy‑Based Routing

Most routing incidents aren’t caused by bad math or broken protocols.
They’re caused by unclear intent — nobody can confidently answer why traffic takes a particular path right now.

In Azure, routing is no longer a background configuration you “set once and forget”. It’s an evolving control surface, shaped by software, scale events, and failure domains. If you’re still treating routing like static infrastructure, you’re designing blind.

The Mental Model

The common assumption

Routing is:

A deterministic outcome of prefixes
Mostly static
Something you “finish” early in a design

Once the tables look right, you move on.

Why this breaks

Azure networks are living systems:

Routes appear and disappear dynamically
Control planes make decisions faster than humans can react
Multiple teams influence traffic paths, often unintentionally

The result is fragile certainty — everything looks fine until a route is withdrawn, a UDR is reused, or a firewall scales differently than expected.

How It Really Works

Think of Azure routing as two distinct software responsibilities:

1. Reachability (dynamic)

This is decided by:

System routes
BGP‑learned routes from ExpressRoute, VPN, or Azure Route Server

Azure Route Server doesn’t “optimise” traffic. It injects and withdraws reachability information into the VNet routing plane, based on what its BGP peers advertise. When a peer stops advertising a prefix, Azure stops believing that path exists.

That behaviour matters during failure — not during steady state.

2. Intent (static, but explicit)

This is enforced using:

User Defined Routes (UDRs)
Subnet‑level scoping
Next‑hop selection

UDRs don’t care why a destination exists. They simply say: if traffic is headed here, it must go that way.

The key insight:
BGP answers “where can I go?”
UDRs answer “where am I allowed to go?”

Architecture Overview

flowchart LR AppSubnet[App Subnet] Firewall[NVA / Firewall] RouteServer[Azure Route Server] ER[ExpressRoute / VPN] Internet[Internet] AppSubnet -->|UDR: default route| Firewall Firewall --> RouteServer RouteServer -->|BGP learned prefixes| Firewall Firewall --> ER Firewall --> Internet

This isn’t about elegance. It’s about predictability under change.

Real‑World Impact

Designing routing this way changes how you operate:

Failure is signalled, not hidden
When an NVA or SD‑WAN appliance stops advertising routes, Azure adapts immediately. There’s no stale static route pretending everything is fine.
Blast radius becomes intentional
Small, scoped route tables mean one bad decision doesn’t rewrite half the VNet.
Routing changes become reviewable
When UDRs and BGP configuration are defined as artefacts, you can reason about diffs, intent, and rollback — not just outcomes.
Incident response gets faster
Engineers debug effective routes, not portal click history.

Implementation Examples

Azure Portal – Where This Actually Bites

Operationally important checks (often missed):

Is route propagation enabled where you expect dynamic routes?
Are UDRs unintentionally overriding critical BGP paths?
Do effective routes differ between subnets that “should” be identical?

Most production routing issues show up here — not during deployment.

Bicep – Making Routing Intent Explicit

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
resource routeTable 'Microsoft.Network/routeTables@2022-09-01' = {
  name: 'rt-app-egress'
  location: resourceGroup().location
  properties: {
    disableBgpRoutePropagation: false
    routes: [
      {
        name: 'force-egress-firewall'
        properties: {
          addressPrefix: '0.0.0.0/0'
          nextHopType: 'VirtualAppliance'
          nextHopIpAddress: '10.10.0.4'
        }
      }
    ]
  }
}

The value here isn’t the route itself — it’s that intent is now inspectable:

Why all egress?
Why this next hop?
What breaks if it’s removed?

That conversation is the win.

Gotchas & Edge Cases (Where Designs Usually Fail)

UDRs override BGP — always
If a UDR points to a dead next hop, Azure will happily black‑hole traffic. Dynamic routing will not save you.
Asymmetric routing isn’t theoretical
It happens the moment inbound and outbound paths are controlled by different teams or constructs.
Shared route tables amplify mistakes
Reuse feels efficient until one “small change” alters traffic for dozens of subnets.
Route Server adds complexity you must own
If you don’t actively test route withdrawal scenarios, you’re just hoping it works.

Best Practices

Use Azure Route Server only when dynamic reachability actually matters
Keep UDRs small, scoped, and boring
Test failures by withdrawing BGP routes — deliberately
Treat effective routes as a first‑class debugging tool
Document why a routing decision exists, not just the prefix

🍺

Brewed Insight: If your routing design only works when nothing changes, it’s not a design — it’s a coincidence.