Writing Effective Prometheus Alerting Rules

Rule Syntax#

Alerting rules live in rule files loaded by Prometheus. Each rule has an expression, an optional for duration, labels, and annotations.

groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error rate above 5% for {{ $labels.job }}"
          description: "Current error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

The for duration is critical. Without it, a single bad scrape triggers an alert. With for: 5m, the condition must be continuously true across all evaluations for 5 minutes before the alert fires. During this window the alert is in pending state.

Labels drive routing in Alertmanager. Annotations carry human-readable context and are not used for routing. Always include a runbook_url annotation for critical alerts.

Infrastructure Alert Patterns#

groups:
  - name: node-alerts
    rules:
      - alert: HighCPUUsage
        expr: |
          (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "CPU above 85% on {{ $labels.instance }}"
          value: "{{ $value | printf \"%.1f\" }}%"

      - alert: DiskSpaceCritical
        expr: |
          (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay", mountpoint="/"}
           / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}:{{ $labels.mountpoint }}"

      - alert: DiskSpacePrediction
        expr: |
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} predicted to fill within 24h"

      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready", status="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is not ready"

      - alert: CertificateExpiringSoon
        expr: |
          (x509_cert_not_after - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate for {{ $labels.cn }} expires in {{ $value | printf \"%.0f\" }} days"

Kubernetes Workload Alert Patterns#

groups:
  - name: kubernetes-workloads
    rules:
      - alert: PodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarted {{ $value | printf \"%.0f\" }} times in 1h"

      - alert: DeploymentReplicaMismatch
        expr: |
          kube_deployment_spec_replicas != kube_deployment_status_ready_replicas
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has {{ $labels.status_ready_replicas }}/{{ $labels.spec_replicas }} ready"

      - alert: PVCUsageHigh
        expr: |
          kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is {{ $value | printf \"%.0f\" }}% full"

      - alert: ContainerOOMKilled
        expr: |
          kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.container }} in {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed"

      - alert: HighPodMemoryUsage
        expr: |
          sum by (namespace, pod) (container_memory_working_set_bytes{container!=""})
          / sum by (namespace, pod) (kube_pod_container_resource_limits{resource="memory"})
          * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} using {{ $value | printf \"%.0f\" }}% of memory limit"

Application Alert Patterns#

groups:
  - name: application-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "5xx error rate above 5% for {{ $labels.job }}"

      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
          > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 2s for {{ $labels.job }}"

      - alert: EndpointDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.instance }} ({{ $labels.job }}) is down"

Avoiding Alert Fatigue#

Alert fatigue is the number one reason monitoring fails. Engineers start ignoring alerts, and real incidents get lost in the noise.

Severity levels. Use two or three levels: critical pages immediately (PagerDuty), warning goes to Slack for business-hours review, info is dashboard-only. If most alerts are critical, none of them are.

Threshold discipline. CPU at 81% for 5 minutes is not an incident. CPU at 95% for 30 minutes probably is. Set for to at least 5 minutes for warnings. Use predict_linear() for trend-based alerts.

Group related alerts. Use Alertmanager’s group_by to batch by namespace or alertname. Five pods restarting should be one notification, not five pages.

Dead-man switch. Create a “Watchdog” alert that always fires. If Alertmanager stops receiving this alert, your monitoring pipeline is broken:

- alert: Watchdog
  expr: vector(1)
  labels:
    severity: none
  annotations:
    summary: "Watchdog alert - monitoring pipeline is healthy"

Route this to a dead-man-switch service like PagerDuty’s heartbeat monitoring or Healthchecks.io.

Testing Rules with promtool#

Validate syntax before deploying:

promtool check rules alert-rules.yml

Write unit tests for your rules:

# alert-rules-test.yml
rule_files:
  - alert-rules.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="api", status_code="500"}'
        values: "0+10x20"
      - series: 'http_requests_total{job="api", status_code="200"}'
        values: "0+100x20"
    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
              job: api

Run the tests:

promtool test rules alert-rules-test.yml

PrometheusRule CRD#

With kube-prometheus-stack, deploy alerting rules as Kubernetes resources instead of editing Prometheus config files directly:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
  labels:
    release: monitoring  # must match Prometheus Operator's ruleSelector
spec:
  groups:
    - name: app-alerts
      rules:
        - alert: HighErrorRate
          expr: job:http_errors:ratio5m > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Error rate above 5% for {{ $labels.job }}"

The release label must match prometheus.prometheusSpec.ruleSelector in your Helm values. The operator watches for PrometheusRule resources and reloads Prometheus automatically.