Routing Tree#
Alertmanager receives alerts from Prometheus and decides where to send them based on a routing tree. Every alert enters at the root route and travels down the tree until it matches a child route. If no child matches, the root route’s receiver handles it.
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/T00/B00/xxx"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
route:
receiver: "default-slack"
group_by: ["alertname", "namespace"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-oncall"
group_wait: 10s
repeat_interval: 1h
routes:
- match:
team: database
receiver: "pagerduty-dba"
- match:
severity: warning
receiver: "team-slack"
repeat_interval: 12h
- match_re:
namespace: "staging|dev"
receiver: "dev-slack"
repeat_interval: 24hTiming parameters matter. group_wait is how long Alertmanager waits after receiving the first alert in a new group before sending the notification – this lets it batch related alerts together. group_interval is the minimum time before sending updates about a group that already fired. repeat_interval controls how often an unchanged active alert is re-sent.
Setting group_wait: 30s with group_by: ["alertname", "namespace"] means if 15 pods in the same namespace trigger the same alert within 30 seconds, they arrive as a single notification with 15 firing instances.
Receivers#
Slack:
receivers:
- name: "team-slack"
slack_configs:
- channel: "#alerts"
send_resolved: true
title: '{{ .GroupLabels.alertname }} ({{ .Status | toUpper }})'
text: >-
{{ range .Alerts }}
*{{ .Labels.alertname }}* - {{ .Annotations.summary }}
{{ .Annotations.description }}
{{ end }}PagerDuty:
- name: "pagerduty-oncall"
pagerduty_configs:
- routing_key: "your-pagerduty-integration-key"
severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'Webhook (for custom integrations):
- name: "webhook-custom"
webhook_configs:
- url: "http://alert-handler.internal:8080/alerts"
send_resolved: true
max_alerts: 0 # 0 = no limitOpsGenie:
- name: "opsgenie-oncall"
opsgenie_configs:
- api_key: "your-opsgenie-api-key"
message: '{{ .GroupLabels.alertname }}'
priority: '{{ if eq .GroupLabels.severity "critical" }}P1{{ else }}P3{{ end }}'
responders:
- type: team
name: "platform-team"Email:
- name: "email-oncall"
email_configs:
- to: "oncall@company.com"
from: "alertmanager@company.com"
smarthost: "smtp.company.com:587"
auth_username: "alertmanager"
auth_password: "password"
send_resolved: trueInhibition Rules#
Inhibition suppresses notifications for less severe alerts when a more severe alert is already firing for the same target. This prevents alert storms where a node going down triggers dozens of downstream alerts.
inhibit_rules:
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
equal: ["alertname", "namespace", "instance"]
- source_matchers:
- alertname = NodeDown
target_matchers:
- severity =~ "warning|critical"
equal: ["instance"]
- source_matchers:
- alertname = ClusterUnreachable
target_matchers:
- alertname =~ ".+"
equal: ["cluster"]The third rule silences all alerts for a cluster when that cluster is unreachable – no point paging someone about pod restarts in a cluster they cannot reach.
Silences#
Silences temporarily mute alerts matching specific label matchers. Create them via the Alertmanager UI or amtool:
# Silence all warnings for the staging namespace for 2 hours
amtool silence add --alertmanager.url=http://localhost:9093 \
severity=warning namespace=staging \
--duration=2h \
--comment="Deploying to staging, expected alerts"
# List active silences
amtool silence query --alertmanager.url=http://localhost:9093
# Expire a silence early
amtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>Alert Templates#
Alertmanager uses Go templating. Templates control notification content and can be defined inline or in external template files.
# alertmanager.yml
templates:
- "/etc/alertmanager/templates/*.tmpl"{{/* /etc/alertmanager/templates/slack.tmpl */}}
{{ define "slack.custom.title" -}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{- end }}
{{ define "slack.custom.text" -}}
{{ range .Alerts }}
{{ if eq .Status "firing" }}:fire:{{ else }}:white_check_mark:{{ end }} *{{ .Labels.alertname }}*
> {{ .Annotations.summary }}
> *Namespace:* `{{ .Labels.namespace }}` | *Pod:* `{{ .Labels.pod }}`
> *Value:* {{ .Annotations.value }}
> *Since:* {{ .StartsAt.Format "2006-01-02 15:04:05 MST" }}
{{ end }}
{{- end }}Reference templates in your receiver config:
receivers:
- name: "team-slack"
slack_configs:
- channel: "#alerts"
title: '{{ template "slack.custom.title" . }}'
text: '{{ template "slack.custom.text" . }}'High Availability#
Alertmanager supports HA through a gossip protocol. Run multiple instances and they coordinate deduplication so each alert is sent only once. All instances must share the same configuration.
# Instance 1
alertmanager --config.file=alertmanager.yml \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-1:9094
# Instance 2
alertmanager --config.file=alertmanager.yml \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-0:9094In Kubernetes with kube-prometheus-stack, HA is configured via the Alertmanager CRD:
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: main
spec:
replicas: 3
configSecret: alertmanager-configDebugging Alert Delivery#
When alerts are not arriving where you expect:
# Check what alerts Alertmanager currently has
amtool alert query --alertmanager.url=http://localhost:9093
# Test routing: which receiver would this alert hit?
amtool config routes test --alertmanager.url=http://localhost:9093 \
severity=critical team=database namespace=production
# Show the full routing tree
amtool config routes show --alertmanager.url=http://localhost:9093
# Validate config syntax
amtool check-config alertmanager.ymlThe most common routing mistake is forgetting that continue: false is the default. Once an alert matches a route, it stops traversing. Add continue: true to a route if an alert should also match sibling routes below it. This is useful when you want both Slack and PagerDuty to receive the same critical alert through separate routes.