When an Alert Should Fire but Does Not#
Silent alerts are the most dangerous failure mode in monitoring. The system appears healthy because no one is being paged, but the condition you intended to catch is actively occurring. Work through this checklist in order.
Step 1: Verify the Expression Returns Results#
Open the Prometheus UI at /graph and run the alert expression directly. If the expression returns empty, the alert cannot fire regardless of anything else.
-- Run the exact expression from your alerting rule
job:http_errors:ratio5m > 0.05Common reasons the expression returns nothing:
- The recording rule that produces
job:http_errors:ratio5mis not evaluating. Check/rulesin Prometheus for errors. - Label matchers in the expression do not match any current time series. Metric labels may have changed after a deployment or scrape config update.
- The metric was renamed. Check the
/metricsendpoint of the target directly withcurlto verify the metric name.
Step 2: Check the for Duration#
The for field requires the condition to be continuously true across consecutive evaluations for the specified duration. This is the single most common reason an alert does not fire.
- alert: HighLatency
expr: histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
for: 15mIf latency spikes above 2 seconds for 12 minutes and then recovers, this alert never leaves pending state. The for timer resets every time the condition becomes false. Check the Prometheus /alerts page – if the alert appears in pending but never reaches firing, the condition is intermittent relative to your for window.
Query the ALERTS and ALERTS_FOR_STATE metrics to see current alert states from PromQL:
-- Show all currently firing alerts
ALERTS{alertstate="firing"}
-- Show when alerts entered pending state
ALERTS_FOR_STATEStep 3: Check Evaluation Interval#
The rule group’s evaluation interval determines how often Prometheus checks the expression. If your rule group evaluates every 60 seconds but the condition lasts only 45 seconds, it may be missed entirely.
groups:
- name: latency-alerts
interval: 15s # evaluate every 15s instead of default 60s
rules:
- alert: HighLatency
expr: ...
for: 1mCheck prometheus_rule_evaluation_failures_total and prometheus_rule_group_last_duration_seconds to detect evaluation problems:
-- Rule groups that are failing to evaluate
increase(prometheus_rule_evaluation_failures_total[1h]) > 0
-- Rule groups that take longer to evaluate than their interval
prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_secondsIf evaluation duration exceeds the interval, Prometheus skips evaluations and your alert may never trigger.
Step 4: Check Alertmanager Routing#
The alert fires in Prometheus but never reaches the intended receiver. This is a routing tree problem.
# Verify the alert reached Alertmanager
amtool alert query --alertmanager.url=http://localhost:9093
# Test which receiver a set of labels would match
amtool config routes test --alertmanager.url=http://localhost:9093 \
severity=critical namespace=production alertname=HighLatency
# Display the full routing tree
amtool config routes show --alertmanager.url=http://localhost:9093Common routing failures:
- The alert matches an earlier route that catches it before reaching the intended one. Routes are evaluated top-down and stop at the first match unless
continue: trueis set. - The alert is silenced. Check
amtool silence query. - The alert is inhibited by another active alert. Review inhibition rules in
alertmanager.yml.
Step 5: Check Receiver Connectivity#
The alert reaches Alertmanager and routes correctly, but the notification never arrives at Slack, PagerDuty, or the webhook endpoint.
Check Alertmanager logs for delivery errors:
# In Kubernetes
kubectl logs -n monitoring alertmanager-main-0 | grep -i "error\|fail\|retry"Common receiver failures:
- Slack API token expired or channel was renamed/archived.
- PagerDuty integration key rotated but not updated in Alertmanager config.
- Webhook endpoint is returning 5xx errors or is unreachable due to network policy.
- TLS certificate verification failing for HTTPS receivers.
Debugging False Positives and Alert Fatigue#
False positives train your team to ignore alerts. Systematically reduce noise with these patterns.
Thresholds That Are Too Aggressive#
If an alert fires frequently but rarely indicates a real problem, the threshold does not reflect normal system behavior. Use the metric’s historical distribution to find the right value:
-- Find the p95 of CPU usage over the past 2 weeks
quantile_over_time(0.95,
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))[14d:5m]
)If p95 over 2 weeks is 72%, setting an alert threshold at 75% guarantees constant firing during normal operation. Set the warning at 85% and critical at 95%.
Missing for Duration#
An alert with no for field fires on a single evaluation that crosses the threshold. Network blips, garbage collection pauses, or a single slow query can trigger it.
# Bad: fires on a single sample
- alert: HighMemory
expr: container_memory_working_set_bytes > 1e9
# Good: must be sustained for 10 minutes
- alert: HighMemory
expr: container_memory_working_set_bytes > 1e9
for: 10mWrong Aggregation Level#
Alerting on per-pod metrics generates noise during rolling deployments when individual pods start and stop.
# Noisy: fires per pod during deployments
- alert: HighCPU
expr: rate(container_cpu_usage_seconds_total{container!=""}[5m]) > 0.9
# Better: aggregate to deployment level
- alert: HighCPU
expr: |
sum by (namespace, deployment) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
) / sum by (namespace, deployment) (
kube_pod_container_resource_limits{resource="cpu"}
) > 0.85
for: 15mDeployment Noise#
Exclude pods that are shutting down to avoid alerts during normal rollouts:
-- Filter out terminating pods
sum by (namespace, deployment) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
* on (namespace, pod) group_left()
(kube_pod_status_phase{phase="Running"} == 1)
)Threshold Selection Strategies#
Statistical Approach#
Calculate percentiles over a representative time window (at least 2 weeks, including weekends and month-end peaks):
-- Normal variance baseline
quantile_over_time(0.99,
sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m]))
[14d:5m]
)Set warning at 2x the p99, critical at 5x. This ensures the alert only fires for conditions truly outside normal operating range.
Symptom-Based Alerting#
Alert on what users experience, not on internal causes. CPU at 90% is not inherently a problem – p99 latency exceeding your SLO is.
# Cause-based (noisy, not actionable)
- alert: HighCPU
expr: node_cpu_usage > 0.9
# Symptom-based (actionable, user-facing)
- alert: SLOLatencyBreach
expr: |
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
> 0.5
for: 10m
annotations:
summary: "p99 latency exceeding 500ms SLO"Environment-Specific Thresholds#
Use label matchers or separate rule files to avoid dev/staging alerts paging production on-call:
- alert: HighErrorRate
expr: |
sum by (job, environment) (rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum by (job, environment) (rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: |-
{{ if eq $labels.environment "production" }}critical{{ else }}warning{{ end }}Alert Dependency Chains with Inhibition#
When a root cause triggers multiple alerts, use inhibition rules so the root cause alert suppresses the cascading symptoms. Without this, a single disk-full event can generate five or more simultaneous pages.
inhibit_rules:
# Disk full suppresses all alerts from the same instance
- source_matchers:
- alertname = DiskSpaceCritical
target_matchers:
- severity =~ "warning|critical"
equal: ["instance"]
# Node down suppresses all pod-level alerts on that node
- source_matchers:
- alertname = NodeNotReady
target_matchers:
- alertname =~ "PodCrashLooping|HighMemory|HighCPU|ContainerOOMKilled"
equal: ["node"]
# Critical always suppresses warning for the same alert
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
equal: ["alertname", "namespace"]Conditional Alerting by Time of Day#
Route the same alert to different severity levels depending on business hours. This avoids waking someone at 3 AM for something that can wait until morning.
# Alertmanager time-based routing
route:
routes:
- match:
severity: warning
receiver: slack-channel
active_time_intervals:
- business-hours
- match:
severity: warning
receiver: pagerduty-oncall
active_time_intervals:
- outside-business-hours
time_intervals:
- name: business-hours
time_intervals:
- weekdays: ["monday:friday"]
times:
- start_time: "09:00"
end_time: "17:00"
- name: outside-business-hours
time_intervals:
- weekdays: ["monday:friday"]
times:
- start_time: "00:00"
end_time: "09:00"
- start_time: "17:00"
end_time: "24:00"
- weekdays: ["saturday", "sunday"]Alert Lifecycle Management#
Alerts degrade over time as systems evolve. Establish a quarterly review process.
Quarterly review checklist:
- Pull the list of all alerts that fired in the last 90 days from Alertmanager or your incident management system.
- For each alert, categorize: led to action, acknowledged but no action needed, ignored entirely.
- Alerts that were ignored more than 80% of the time are candidates for deletion or threshold adjustment.
- Alerts that never fired should be validated – run the expression manually and confirm it would fire under the expected failure condition.
- Check for coverage gaps: review post-incident reports from the quarter and verify an alert existed (or now exists) for each incident.
Track alert signal-to-noise ratio over time:
-- Alerts that fired in the past 30 days (count by alertname)
count_over_time(ALERTS{alertstate="firing"}[30d])Practical Investigation Checklist#
When you receive a report that “the alert should have fired but didn’t,” work through this sequence:
- Reproduce the expression. Run the alerting rule’s
exprin Prometheus/graphfor the time window in question. Does it return results? - Check for stale series. If the metric stopped being emitted more than 5 minutes ago, Prometheus marks it stale. Use
timestamp()to verify:timestamp(my_metric) > (time() - 300). - Check
forvs actual duration. Look at the metric graph. Was the condition sustained for the fullforduration without interruption? - Check rule evaluation. Query
prometheus_rule_evaluation_failures_totalfor the rule group. Any failures mean missed evaluations. - Check Alertmanager receipt. Query
amtool alert queryfor the time window. If the alert is not present, the problem is on the Prometheus side. - Check routing. Run
amtool config routes testwith the alert’s labels. - Check silences and inhibitions. Run
amtool silence queryand review inhibition rules. - Check receiver logs. Examine Alertmanager logs for delivery errors to the configured receiver.