Rule Syntax#
Alerting rules live in rule files loaded by Prometheus. Each rule has an expression, an optional for duration, labels, and annotations.
groups:
- name: example
rules:
- alert: HighErrorRate
expr: job:http_errors:ratio5m > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "Error rate above 5% for {{ $labels.job }}"
description: "Current error rate is {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.internal/runbooks/high-error-rate"The for duration is critical. Without it, a single bad scrape triggers an alert. With for: 5m, the condition must be continuously true across all evaluations for 5 minutes before the alert fires. During this window the alert is in pending state.
Labels drive routing in Alertmanager. Annotations carry human-readable context and are not used for routing. Always include a runbook_url annotation for critical alerts.
Infrastructure Alert Patterns#
groups:
- name: node-alerts
rules:
- alert: HighCPUUsage
expr: |
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 85
for: 15m
labels:
severity: warning
annotations:
summary: "CPU above 85% on {{ $labels.instance }}"
value: "{{ $value | printf \"%.1f\" }}%"
- alert: DiskSpaceCritical
expr: |
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay", mountpoint="/"}
/ node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}:{{ $labels.mountpoint }}"
- alert: DiskSpacePrediction
expr: |
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} predicted to fill within 24h"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready", status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
- alert: CertificateExpiringSoon
expr: |
(x509_cert_not_after - time()) / 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate for {{ $labels.cn }} expires in {{ $value | printf \"%.0f\" }} days"Kubernetes Workload Alert Patterns#
groups:
- name: kubernetes-workloads
rules:
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarted {{ $value | printf \"%.0f\" }} times in 1h"
- alert: DeploymentReplicaMismatch
expr: |
kube_deployment_spec_replicas != kube_deployment_status_ready_replicas
for: 15m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has {{ $labels.status_ready_replicas }}/{{ $labels.spec_replicas }} ready"
- alert: PVCUsageHigh
expr: |
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is {{ $value | printf \"%.0f\" }}% full"
- alert: ContainerOOMKilled
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} in {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed"
- alert: HighPodMemoryUsage
expr: |
sum by (namespace, pod) (container_memory_working_set_bytes{container!=""})
/ sum by (namespace, pod) (kube_pod_container_resource_limits{resource="memory"})
* 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} using {{ $value | printf \"%.0f\" }}% of memory limit"Application Alert Patterns#
groups:
- name: application-alerts
rules:
- alert: HighErrorRate
expr: |
sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "5xx error rate above 5% for {{ $labels.job }}"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
> 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "p99 latency above 2s for {{ $labels.job }}"
- alert: EndpointDown
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Target {{ $labels.instance }} ({{ $labels.job }}) is down"Avoiding Alert Fatigue#
Alert fatigue is the number one reason monitoring fails. Engineers start ignoring alerts, and real incidents get lost in the noise.
Severity levels. Use two or three levels: critical pages immediately (PagerDuty), warning goes to Slack for business-hours review, info is dashboard-only. If most alerts are critical, none of them are.
Threshold discipline. CPU at 81% for 5 minutes is not an incident. CPU at 95% for 30 minutes probably is. Set for to at least 5 minutes for warnings. Use predict_linear() for trend-based alerts.
Group related alerts. Use Alertmanager’s group_by to batch by namespace or alertname. Five pods restarting should be one notification, not five pages.
Dead-man switch. Create a “Watchdog” alert that always fires. If Alertmanager stops receiving this alert, your monitoring pipeline is broken:
- alert: Watchdog
expr: vector(1)
labels:
severity: none
annotations:
summary: "Watchdog alert - monitoring pipeline is healthy"Route this to a dead-man-switch service like PagerDuty’s heartbeat monitoring or Healthchecks.io.
Testing Rules with promtool#
Validate syntax before deploying:
promtool check rules alert-rules.ymlWrite unit tests for your rules:
# alert-rules-test.yml
rule_files:
- alert-rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{job="api", status_code="500"}'
values: "0+10x20"
- series: 'http_requests_total{job="api", status_code="200"}'
values: "0+100x20"
alert_rule_test:
- eval_time: 10m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: critical
job: apiRun the tests:
promtool test rules alert-rules-test.ymlPrometheusRule CRD#
With kube-prometheus-stack, deploy alerting rules as Kubernetes resources instead of editing Prometheus config files directly:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
labels:
release: monitoring # must match Prometheus Operator's ruleSelector
spec:
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: job:http_errors:ratio5m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for {{ $labels.job }}"The release label must match prometheus.prometheusSpec.ruleSelector in your Helm values. The operator watches for PrometheusRule resources and reloads Prometheus automatically.