Azure Observability on Tap - End-to-End Azure Monitoring

Why Alerts Matter

Setting up dashboards and logging is great—but it won’t help much at 2am if no one’s watching.

Azure Monitor alerts are your line of defence between quiet systems and production chaos. Combined with Action Groups, they make sure the right people (or scripts) get notified and take action ASAP.

With this post, you’ll learn:

How to create metric and log-based alerts
How Action Groups work—and what’s “in an alert”
Bonus tips for noise reduction and targeting alerts by environment or team

What Are Azure Alerts & Action Groups?

Azure Monitor Alerts let you trigger notifications or automated actions when a signal crosses a threshold or matches a query.

There are three common types:

Type	Triggered By	Example
Metric alert	Real-time metrics	CPU > 80% for 5 mins
Log alert	KQL query results from Log Analytics	HTTP 500 errors > 10 in 5 mins
Activity log alert	Azure resource event	VM deleted or NSG updated

Action Groups define what happens—like sending email/SMS, calling a webhook, or triggering automation.

Azure Portal Walkthrough

Step 1: Create an Action Group (Notification Recipient)

Go to Azure Monitor → Alerts → Action Groups
Click + Create
Choose:
- Name & short name (used in alert UI)
- Resource Group & Region
Under Notifications, pick:
- Email, SMS, Push notification, or ITSM
Under Actions, choose:
- Webhook, Azure Function, Logic App, Automation Runbook
Give it a tag like env = production or team = ops
Click Review + Create

Step 2: Create a Metric Alert Rule

Let’s alert on something common, like high CPU on a VM:

Azure Monitor → Alerts → + Create → Alert rule
Under Scope, choose a VM or App Service
Under Condition, choose a signal (e.g. Percentage CPU)
Hit More options:
- Set threshold (e.g. Above 80%)
- Choose aggregation (e.g. average over last 5 mins)
Under Actions, select the Action Group you created
Name it clearly (e.g. vm-cpu-prod-high)
Review + Create

🔎 Step 3: Create a Log-Based Alert (e.g. HTTP 5xx from App Service)

Same flow → but choose Log Analytics Workspace for Scope
Condition → Custom log search

Example KQL query:

1
2
3
4


AppRequests
| where ResultCode startswith '5'
| summarize count() by bin(TimeGenerated, 5m)
| where count_ > 10

Set frequency (5m) and lookback (5m)
Attach your Action Group
Finish and test

Bicep: Create Alert Rule and Action Group via IaC

Here’s an example setup for a dynamic metric-based alert + action group:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77


@description('Action Group Name for alert notifications')
param actionGroupName string = 'MyMonitoringActionGroup'

@description('Email //address to notify')
param emailToNotify string = 'someone@something.com'

@description('Target VM Name')
param vmname string = 'myVM'

@description('Resource Group Name for the target VM')
param targetVmResourceGroup string = 'MyVMResourceGroup'

@description('Metric Name which will be prepended by VM Name') 
param metricName string = '-HighCPU'

// Reference the existing Virtual Machine
resource targetVm 'Microsoft.Compute/virtualMachines@2024-11-01' existing = {
  name: vmname
  scope: resourceGroup(targetVmResourceGroup)
}

// Create Action Group
resource actionGroup 'Microsoft.Insights/actionGroups@2024-10-01-preview' = {
  name: actionGroupName
  location: 'global'
  properties: {
    enabled: true
    groupShortName: 'ProdAlerts'
    emailReceivers: [
      {
        name: 'PrimaryOpsContact'
        emailAddress: emailToNotify
      }
    ]
  }
}

// Create a Metric Alert Rule for High CPU using dynamic threshold with low sensitivity
resource metricAlerts_myVM_HighCPU_name_resource 'microsoft.insights/metricAlerts@2018-03-01' = {
  name: '${vmname}${metricName}'
  location: 'global'
  properties: {
    severity: 1
    enabled: true
    scopes: [
      targetVm.id
    ]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT5M'
    criteria: {
      allOf: [
        {
          alertSensitivity: 'Low'
          failingPeriods: {
            numberOfEvaluationPeriods: 4
            minFailingPeriodsToAlert: 4
          }
          name: 'Metric1'
          metricNamespace: 'microsoft.compute/virtualmachines'
          metricName: 'Percentage CPU'
          operator: 'GreaterOrLessThan'
          timeAggregation: 'Average'
          criterionType: 'DynamicThresholdCriterion'
        }
      ]
      'odata.type': 'Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria'
    }
    autoMitigate: true
    targetResourceType: 'Microsoft.Compute/virtualMachines'
    targetResourceRegion: resourceGroup().location
    actions: [
      {
        actionGroupId: actionGroup.id
      }
    ]
  }
}

Architecture Overview

graph TD AlertRule1[CPU Alert Rule] --> AG1[Action Group] AlertRule2[Log KQL Alert Rule] --> AG1 AG1 --> Email[Email Notification] LogSearch --> AlertRule2 MetricCPU --> AlertRule1

🍺

Brewed Insight:

You can also have an action for WebHook which can send alert details (like severity, resource ID, timestamp, description) in a structured JSON payload to a monitoring/incident management system (e.g., PagerDuty, Splunk, ServiceNow, Slack via middleware)

The receiving endpoint can then trigger custom logic such as automatically creating tickets or incidents, kick-off a remediation script, logging or auditing alert events

Gotchas & Guidance

Alerts on ephemeral resources: Avoid alerting on short-lived services unless scoped by tag or app layer.
Avoid duplicates: Azure doesn’t deduplicate alert rules automatically. Be strict with naming conventions like app-env-metric-type.
Log alerts = delayed reaction (~5m+ latency). Use metric alerts for near real-time scenarios.
If you’re alerting on App Gateway or Firewall logs? Watch out for high volume, and sample/summarise in KQL.

Best Practices

Always name alerts like you name functions: clear, consistent, lowercase (apigw-5xx-prod, vm-cpu-east1)
Assign owners via tags (e.g. team = sre)
Use severity levels with meaning (e.g. Sev 0 = page ops, Sev 3 = Slack-only)
Collect and track alerts into a central Log Analytics workspace if doing alert fatigue analysis
Enable alert suppression and smart groups to reduce noise

🍺

Brewed Insight: If you’re looking to do Azure Monitor at a larger scale check out Cloud Clarity’s Azure Monitor Script in the Learn More section below, built from the team at cubesys & cloud clarity its about deploying azure monitor metric alerts at scale, and managing exceptions through Tags! If you add some DevOps pipelines you can then deploy this on a schedule to make sure all new resources get the metric alerts you’ve defined in a control and consistent way.