Debugging and Tuning Alerts: Why Alerts Don't Fire, False Positives, and Threshold Selection

When an Alert Should Fire but Does Not#

Silent alerts are the most dangerous failure mode in monitoring. The system appears healthy because no one is being paged, but the condition you intended to catch is actively occurring. Work through this checklist in order.

Step 1: Verify the Expression Returns Results#

Open the Prometheus UI at /graph and run the alert expression directly. If the expression returns empty, the alert cannot fire regardless of anything else.

-- Run the exact expression from your alerting rule
job:http_errors:ratio5m > 0.05

Common reasons the expression returns nothing:

The recording rule that produces job:http_errors:ratio5m is not evaluating. Check /rules in Prometheus for errors.
Label matchers in the expression do not match any current time series. Metric labels may have changed after a deployment or scrape config update.
The metric was renamed. Check the /metrics endpoint of the target directly with curl to verify the metric name.

Step 2: Check the `for` Duration#

The for field requires the condition to be continuously true across consecutive evaluations for the specified duration. This is the single most common reason an alert does not fire.

- alert: HighLatency
  expr: histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
  for: 15m

If latency spikes above 2 seconds for 12 minutes and then recovers, this alert never leaves pending state. The for timer resets every time the condition becomes false. Check the Prometheus /alerts page – if the alert appears in pending but never reaches firing, the condition is intermittent relative to your for window.

Query the ALERTS and ALERTS_FOR_STATE metrics to see current alert states from PromQL:

-- Show all currently firing alerts
ALERTS{alertstate="firing"}

-- Show when alerts entered pending state
ALERTS_FOR_STATE

Step 3: Check Evaluation Interval#

The rule group’s evaluation interval determines how often Prometheus checks the expression. If your rule group evaluates every 60 seconds but the condition lasts only 45 seconds, it may be missed entirely.

groups:
  - name: latency-alerts
    interval: 15s   # evaluate every 15s instead of default 60s
    rules:
      - alert: HighLatency
        expr: ...
        for: 1m

Check prometheus_rule_evaluation_failures_total and prometheus_rule_group_last_duration_seconds to detect evaluation problems:

-- Rule groups that are failing to evaluate
increase(prometheus_rule_evaluation_failures_total[1h]) > 0

-- Rule groups that take longer to evaluate than their interval
prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds

If evaluation duration exceeds the interval, Prometheus skips evaluations and your alert may never trigger.

Step 4: Check Alertmanager Routing#

The alert fires in Prometheus but never reaches the intended receiver. This is a routing tree problem.

# Verify the alert reached Alertmanager
amtool alert query --alertmanager.url=http://localhost:9093

# Test which receiver a set of labels would match
amtool config routes test --alertmanager.url=http://localhost:9093 \
  severity=critical namespace=production alertname=HighLatency

# Display the full routing tree
amtool config routes show --alertmanager.url=http://localhost:9093

Common routing failures:

The alert matches an earlier route that catches it before reaching the intended one. Routes are evaluated top-down and stop at the first match unless continue: true is set.
The alert is silenced. Check amtool silence query.
The alert is inhibited by another active alert. Review inhibition rules in alertmanager.yml.

Step 5: Check Receiver Connectivity#

The alert reaches Alertmanager and routes correctly, but the notification never arrives at Slack, PagerDuty, or the webhook endpoint.

Check Alertmanager logs for delivery errors:

# In Kubernetes
kubectl logs -n monitoring alertmanager-main-0 | grep -i "error\|fail\|retry"

Common receiver failures:

Slack API token expired or channel was renamed/archived.
PagerDuty integration key rotated but not updated in Alertmanager config.
Webhook endpoint is returning 5xx errors or is unreachable due to network policy.
TLS certificate verification failing for HTTPS receivers.

Debugging False Positives and Alert Fatigue#

False positives train your team to ignore alerts. Systematically reduce noise with these patterns.

Thresholds That Are Too Aggressive#

If an alert fires frequently but rarely indicates a real problem, the threshold does not reflect normal system behavior. Use the metric’s historical distribution to find the right value:

-- Find the p95 of CPU usage over the past 2 weeks
quantile_over_time(0.95,
  (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))[14d:5m]
)

If p95 over 2 weeks is 72%, setting an alert threshold at 75% guarantees constant firing during normal operation. Set the warning at 85% and critical at 95%.

Missing `for` Duration#

An alert with no for field fires on a single evaluation that crosses the threshold. Network blips, garbage collection pauses, or a single slow query can trigger it.

# Bad: fires on a single sample
- alert: HighMemory
  expr: container_memory_working_set_bytes > 1e9

# Good: must be sustained for 10 minutes
- alert: HighMemory
  expr: container_memory_working_set_bytes > 1e9
  for: 10m

Wrong Aggregation Level#

Alerting on per-pod metrics generates noise during rolling deployments when individual pods start and stop.

# Noisy: fires per pod during deployments
- alert: HighCPU
  expr: rate(container_cpu_usage_seconds_total{container!=""}[5m]) > 0.9

# Better: aggregate to deployment level
- alert: HighCPU
  expr: |
    sum by (namespace, deployment) (
      rate(container_cpu_usage_seconds_total{container!=""}[5m])
    ) / sum by (namespace, deployment) (
      kube_pod_container_resource_limits{resource="cpu"}
    ) > 0.85
  for: 15m

Deployment Noise#

Exclude pods that are shutting down to avoid alerts during normal rollouts:

-- Filter out terminating pods
sum by (namespace, deployment) (
  rate(container_cpu_usage_seconds_total{container!=""}[5m])
  * on (namespace, pod) group_left()
  (kube_pod_status_phase{phase="Running"} == 1)
)

Threshold Selection Strategies#

Statistical Approach#

Calculate percentiles over a representative time window (at least 2 weeks, including weekends and month-end peaks):

-- Normal variance baseline
quantile_over_time(0.99,
  sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
  / sum by (job) (rate(http_requests_total[5m]))
  [14d:5m]
)

Set warning at 2x the p99, critical at 5x. This ensures the alert only fires for conditions truly outside normal operating range.

Symptom-Based Alerting#

Alert on what users experience, not on internal causes. CPU at 90% is not inherently a problem – p99 latency exceeding your SLO is.

# Cause-based (noisy, not actionable)
- alert: HighCPU
  expr: node_cpu_usage > 0.9

# Symptom-based (actionable, user-facing)
- alert: SLOLatencyBreach
  expr: |
    histogram_quantile(0.99,
      sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
    > 0.5
  for: 10m
  annotations:
    summary: "p99 latency exceeding 500ms SLO"

Environment-Specific Thresholds#

Use label matchers or separate rule files to avoid dev/staging alerts paging production on-call:

- alert: HighErrorRate
  expr: |
    sum by (job, environment) (rate(http_requests_total{status_code=~"5.."}[5m]))
    / sum by (job, environment) (rate(http_requests_total[5m]))
    > 0.05
  for: 5m
  labels:
    severity: |-
      {{ if eq $labels.environment "production" }}critical{{ else }}warning{{ end }}

Alert Dependency Chains with Inhibition#

When a root cause triggers multiple alerts, use inhibition rules so the root cause alert suppresses the cascading symptoms. Without this, a single disk-full event can generate five or more simultaneous pages.

inhibit_rules:
  # Disk full suppresses all alerts from the same instance
  - source_matchers:
      - alertname = DiskSpaceCritical
    target_matchers:
      - severity =~ "warning|critical"
    equal: ["instance"]

  # Node down suppresses all pod-level alerts on that node
  - source_matchers:
      - alertname = NodeNotReady
    target_matchers:
      - alertname =~ "PodCrashLooping|HighMemory|HighCPU|ContainerOOMKilled"
    equal: ["node"]

  # Critical always suppresses warning for the same alert
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: ["alertname", "namespace"]

Conditional Alerting by Time of Day#

Route the same alert to different severity levels depending on business hours. This avoids waking someone at 3 AM for something that can wait until morning.

# Alertmanager time-based routing
route:
  routes:
    - match:
        severity: warning
      receiver: slack-channel
      active_time_intervals:
        - business-hours
    - match:
        severity: warning
      receiver: pagerduty-oncall
      active_time_intervals:
        - outside-business-hours

time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ["monday:friday"]
        times:
          - start_time: "09:00"
            end_time: "17:00"
  - name: outside-business-hours
    time_intervals:
      - weekdays: ["monday:friday"]
        times:
          - start_time: "00:00"
            end_time: "09:00"
          - start_time: "17:00"
            end_time: "24:00"
      - weekdays: ["saturday", "sunday"]

Alert Lifecycle Management#

Alerts degrade over time as systems evolve. Establish a quarterly review process.

Quarterly review checklist:

Pull the list of all alerts that fired in the last 90 days from Alertmanager or your incident management system.
For each alert, categorize: led to action, acknowledged but no action needed, ignored entirely.
Alerts that were ignored more than 80% of the time are candidates for deletion or threshold adjustment.
Alerts that never fired should be validated – run the expression manually and confirm it would fire under the expected failure condition.
Check for coverage gaps: review post-incident reports from the quarter and verify an alert existed (or now exists) for each incident.

Track alert signal-to-noise ratio over time:

-- Alerts that fired in the past 30 days (count by alertname)
count_over_time(ALERTS{alertstate="firing"}[30d])

Practical Investigation Checklist#

When you receive a report that “the alert should have fired but didn’t,” work through this sequence:

Reproduce the expression. Run the alerting rule’s expr in Prometheus /graph for the time window in question. Does it return results?
Check for stale series. If the metric stopped being emitted more than 5 minutes ago, Prometheus marks it stale. Use timestamp() to verify: timestamp(my_metric) > (time() - 300).
Check for vs actual duration. Look at the metric graph. Was the condition sustained for the full for duration without interruption?
Check rule evaluation. Query prometheus_rule_evaluation_failures_total for the rule group. Any failures mean missed evaluations.
Check Alertmanager receipt. Query amtool alert query for the time window. If the alert is not present, the problem is on the Prometheus side.
Check routing. Run amtool config routes test with the alert’s labels.
Check silences and inhibitions. Run amtool silence query and review inhibition rules.
Check receiver logs. Examine Alertmanager logs for delivery errors to the configured receiver.