Alertmanager Configuration and Routing

Routing Tree#

Alertmanager receives alerts from Prometheus and decides where to send them based on a routing tree. Every alert enters at the root route and travels down the tree until it matches a child route. If no child matches, the root route’s receiver handles it.

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/xxx"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  receiver: "default-slack"
  group_by: ["alertname", "namespace"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-oncall"
      group_wait: 10s
      repeat_interval: 1h
      routes:
        - match:
            team: database
          receiver: "pagerduty-dba"
    - match:
        severity: warning
      receiver: "team-slack"
      repeat_interval: 12h
    - match_re:
        namespace: "staging|dev"
      receiver: "dev-slack"
      repeat_interval: 24h

Timing parameters matter. group_wait is how long Alertmanager waits after receiving the first alert in a new group before sending the notification – this lets it batch related alerts together. group_interval is the minimum time before sending updates about a group that already fired. repeat_interval controls how often an unchanged active alert is re-sent.

Setting group_wait: 30s with group_by: ["alertname", "namespace"] means if 15 pods in the same namespace trigger the same alert within 30 seconds, they arrive as a single notification with 15 firing instances.

Receivers#

Slack:

receivers:
  - name: "team-slack"
    slack_configs:
      - channel: "#alerts"
        send_resolved: true
        title: '{{ .GroupLabels.alertname }} ({{ .Status | toUpper }})'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.alertname }}* - {{ .Annotations.summary }}
          {{ .Annotations.description }}
          {{ end }}

PagerDuty:

  - name: "pagerduty-oncall"
    pagerduty_configs:
      - routing_key: "your-pagerduty-integration-key"
        severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'

Webhook (for custom integrations):

  - name: "webhook-custom"
    webhook_configs:
      - url: "http://alert-handler.internal:8080/alerts"
        send_resolved: true
        max_alerts: 0  # 0 = no limit

OpsGenie:

  - name: "opsgenie-oncall"
    opsgenie_configs:
      - api_key: "your-opsgenie-api-key"
        message: '{{ .GroupLabels.alertname }}'
        priority: '{{ if eq .GroupLabels.severity "critical" }}P1{{ else }}P3{{ end }}'
        responders:
          - type: team
            name: "platform-team"

Email:

  - name: "email-oncall"
    email_configs:
      - to: "oncall@company.com"
        from: "alertmanager@company.com"
        smarthost: "smtp.company.com:587"
        auth_username: "alertmanager"
        auth_password: "password"
        send_resolved: true

Inhibition Rules#

Inhibition suppresses notifications for less severe alerts when a more severe alert is already firing for the same target. This prevents alert storms where a node going down triggers dozens of downstream alerts.

inhibit_rules:
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: ["alertname", "namespace", "instance"]

  - source_matchers:
      - alertname = NodeDown
    target_matchers:
      - severity =~ "warning|critical"
    equal: ["instance"]

  - source_matchers:
      - alertname = ClusterUnreachable
    target_matchers:
      - alertname =~ ".+"
    equal: ["cluster"]

The third rule silences all alerts for a cluster when that cluster is unreachable – no point paging someone about pod restarts in a cluster they cannot reach.

Silences#

Silences temporarily mute alerts matching specific label matchers. Create them via the Alertmanager UI or amtool:

# Silence all warnings for the staging namespace for 2 hours
amtool silence add --alertmanager.url=http://localhost:9093 \
  severity=warning namespace=staging \
  --duration=2h \
  --comment="Deploying to staging, expected alerts"

# List active silences
amtool silence query --alertmanager.url=http://localhost:9093

# Expire a silence early
amtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>

Alert Templates#

Alertmanager uses Go templating. Templates control notification content and can be defined inline or in external template files.

# alertmanager.yml
templates:
  - "/etc/alertmanager/templates/*.tmpl"

{{/* /etc/alertmanager/templates/slack.tmpl */}}
{{ define "slack.custom.title" -}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{- end }}

{{ define "slack.custom.text" -}}
{{ range .Alerts }}
{{ if eq .Status "firing" }}:fire:{{ else }}:white_check_mark:{{ end }} *{{ .Labels.alertname }}*
> {{ .Annotations.summary }}
> *Namespace:* `{{ .Labels.namespace }}` | *Pod:* `{{ .Labels.pod }}`
> *Value:* {{ .Annotations.value }}
> *Since:* {{ .StartsAt.Format "2006-01-02 15:04:05 MST" }}
{{ end }}
{{- end }}

Reference templates in your receiver config:

receivers:
  - name: "team-slack"
    slack_configs:
      - channel: "#alerts"
        title: '{{ template "slack.custom.title" . }}'
        text: '{{ template "slack.custom.text" . }}'

High Availability#

Alertmanager supports HA through a gossip protocol. Run multiple instances and they coordinate deduplication so each alert is sent only once. All instances must share the same configuration.

# Instance 1
alertmanager --config.file=alertmanager.yml \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-1:9094

# Instance 2
alertmanager --config.file=alertmanager.yml \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-0:9094

In Kubernetes with kube-prometheus-stack, HA is configured via the Alertmanager CRD:

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: main
spec:
  replicas: 3
  configSecret: alertmanager-config

Debugging Alert Delivery#

When alerts are not arriving where you expect:

# Check what alerts Alertmanager currently has
amtool alert query --alertmanager.url=http://localhost:9093

# Test routing: which receiver would this alert hit?
amtool config routes test --alertmanager.url=http://localhost:9093 \
  severity=critical team=database namespace=production

# Show the full routing tree
amtool config routes show --alertmanager.url=http://localhost:9093

# Validate config syntax
amtool check-config alertmanager.yml

The most common routing mistake is forgetting that continue: false is the default. Once an alert matches a route, it stops traversing. Add continue: true to a route if an alert should also match sibling routes below it. This is useful when you want both Slack and PagerDuty to receive the same critical alert through separate routes.