Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees

Why Runbooks Exist#

An on-call engineer paged at 3 AM has limited cognitive capacity. They may not be familiar with the specific service that is failing. They may have joined the team two weeks ago. A runbook bridges the gap between the alert firing and the correct human response. Without runbooks, incident response depends on tribal knowledge – the engineer who built the service and knows its failure modes. That engineer is on vacation when the incident hits.

Good runbooks reduce mean time to resolution (MTTR) by providing a structured path from symptom to diagnosis to remediation. They are not documentation for documentation’s sake. They are operational tools that get used under pressure.

Runbook Format#

Every runbook should follow a consistent structure. When an engineer opens a runbook at 3 AM, they should know exactly where to find each piece of information because every runbook is laid out the same way.

Header Section#

# Runbook: [Alert Name or Service Name]

**Service:** api-gateway
**Team:** platform-infrastructure
**Last Updated:** 2026-02-15
**Last Tested:** 2026-01-20
**Owner:** @jane.doe

## Quick Summary
One-sentence description of what this runbook covers and when to use it.

## Impact
What breaks when this service is unhealthy. User-facing impact in plain language.
- Users cannot log in
- API responses return 503
- Payment processing is delayed

The header establishes context immediately. The engineer knows what service they are dealing with, who owns it, how recently the runbook was validated, and what the user impact is. The impact section helps them calibrate urgency – “users cannot log in” triggers a different response level than “internal dashboard loads slowly.”

Diagnostic Steps#

Structure diagnostics as numbered steps with expected outputs. Each step should tell the engineer what to run, what to look for, and what each result means.

## Diagnosis

### Step 1: Check Service Health
\```bash
kubectl get pods -n api-gateway -l app=api-gateway
\```

**Expected:** All pods in Running state, READY shows matching container counts (e.g., 2/2).

**If pods are CrashLoopBackOff:** Jump to [CrashLoop Remediation](#crashloop).
**If pods are Running but not Ready:** Jump to [Readiness Probe Failures](#readiness).
**If pods are Pending:** Jump to [Scheduling Failures](#scheduling).

### Step 2: Check Error Rate
\```bash
# Open Grafana dashboard: https://grafana.internal/d/api-gateway-overview
# Or query Prometheus directly:
curl -s 'http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{job="api-gateway",code=~"5.."}[5m]))/sum(rate(http_requests_total{job="api-gateway"}[5m]))'
\```

**Expected:** Error rate below 0.1%.
**If error rate is above 1%:** Check upstream dependencies (Step 3).
**If error rate is above 10%:** This is a major incident. Escalate immediately per [Escalation Procedures](#escalation).

Each step branches based on what the engineer observes. This is the key difference between a runbook and documentation. Documentation describes how a system works. A runbook tells you what to do next based on what you see right now.

Remediation Actions#

Remediation steps must be concrete, copy-pasteable commands with clear descriptions of what they do and what side effects they have.

## Remediation

### <a name="crashloop"></a>CrashLoop Remediation

1. Check the most recent pod logs for the crash cause:
\```bash
kubectl logs -n api-gateway -l app=api-gateway --previous --tail=100
\```

2. If logs show OOMKilled:
\```bash
# Increase memory limit temporarily (will be reset on next deploy)
kubectl set resources deployment/api-gateway -n api-gateway \
  --limits=memory=1Gi
\```
**Side effect:** Pod restarts with new memory limit. Existing connections are dropped.
**Follow-up required:** Update Helm values to persist the change.

3. If logs show database connection errors:
\```bash
# Verify database connectivity from within the cluster
kubectl exec -n api-gateway deploy/api-gateway -- \
  pg_isready -h db-primary.database.svc -p 5432
\```
**If database is unreachable:** Escalate to database team. See [Database Runbook](link).

Always state side effects. “This command restarts the pod” is critical information during an active incident. The engineer needs to know whether running a command will cause additional user impact.

Escalation Procedures#

Escalation procedures define when and how to involve additional people. A clear escalation policy prevents two failure modes: under-escalation (the engineer struggles alone for an hour before asking for help) and over-escalation (the entire leadership chain is woken up for a minor issue).

Escalation Tiers#

## Escalation

### Tier 1: On-Call Engineer (0-15 minutes)
- Follow this runbook's diagnostic steps
- If the issue matches a known scenario, apply the documented remediation
- If the issue is resolved, document what happened in the incident channel

### Tier 2: Service Owner (15-30 minutes)
- **When to escalate:** Diagnosis does not match any documented scenario, or
  remediation steps are not resolving the issue
- **Contact:** @api-gateway-oncall (PagerDuty rotation)
- **What to provide:** Current state, steps already taken, error messages observed

### Tier 3: Engineering Manager + Incident Commander (30+ minutes)
- **When to escalate:** User-facing impact has persisted for 30+ minutes, or
  impact is escalating (error rate increasing, more services affected)
- **Contact:** Page the incident commander rotation
- **What to provide:** Timeline, current impact, resources needed

### Tier 4: Executive Notification (60+ minutes)
- **When to escalate:** Major outage affecting >50% of users, or
  data loss is confirmed or suspected
- **Contact:** VP Engineering via incident bridge

Time-Based Escalation Rules#

Embed escalation triggers in the alerting system itself. If an alert has been firing for 30 minutes without acknowledgment, automatically escalate to the next tier. PagerDuty and Opsgenie both support escalation policies that handle this automatically.

# PagerDuty escalation policy (conceptual)
escalation_policy:
  name: api-gateway
  rules:
    - targets:
        - type: schedule
          id: api-gateway-primary-oncall
      escalation_delay_in_minutes: 15
    - targets:
        - type: schedule
          id: api-gateway-secondary-oncall
      escalation_delay_in_minutes: 15
    - targets:
        - type: user
          id: engineering-manager
      escalation_delay_in_minutes: 30

Diagnostic Decision Trees#

Decision trees encode the diagnostic reasoning of your most experienced engineers into a followable path. They are the most valuable part of a runbook because they transfer expertise.

Building Effective Decision Trees#

Start with the alert or symptom, then branch on observable conditions. Each branch should lead to either another diagnostic step or a remediation action.

ALERT: APIHighErrorRate (>1% 5xx responses)

1. Are all pods Running and Ready?
   |
   +-- NO: Are pods CrashLoopBackOff?
   |   +-- YES --> Check logs for OOM or panic. Increase resources or rollback.
   |   +-- NO (Pending) --> Check node capacity. Are nodes full?
   |       +-- YES --> Scale node pool or evict low-priority pods.
   |       +-- NO --> Check pod events for scheduling constraints.
   |
   +-- YES: Is the error rate uniform across all pods?
       |
       +-- YES (all pods affected): Is the database healthy?
       |   +-- NO --> Database issue. Failover or escalate to DB team.
       |   +-- YES --> Was there a recent deployment?
       |       +-- YES --> Rollback: kubectl rollout undo deployment/api-gateway
       |       +-- NO --> Check upstream dependencies (auth service, cache)
       |
       +-- NO (specific pods): Is one pod on a degraded node?
           +-- YES --> Cordon node, delete affected pod (will reschedule).
           +-- NO --> Check pod-specific logs for clues.

Common Diagnostic Scenarios#

These scenarios cover the most frequent incident patterns. Each one is a self-contained mini-runbook that can be referenced from the decision tree.

Scenario: Deployment-Caused Regression

### Deployment Regression

**Symptoms:** Error rate spike immediately following a deployment.
**Diagnosis time:** 2-5 minutes.

1. Confirm a deployment happened recently:
\```bash
kubectl rollout history deployment/api-gateway -n api-gateway
\```

2. Compare the current error rate to the rate before the deployment:
\```promql
rate(http_requests_total{job="api-gateway",code=~"5.."}[5m])
\```
Check the Grafana dashboard with the time range covering the deployment window.

3. Rollback:
\```bash
kubectl rollout undo deployment/api-gateway -n api-gateway
\```

4. Monitor error rate for 5 minutes after rollback.
5. Notify the deploying engineer and create a ticket for investigation.

Scenario: Upstream Dependency Failure

### Upstream Dependency Failure

**Symptoms:** Elevated error rate, logs show connection timeouts or 502/503 from
upstream services.

1. Identify which upstream is failing:
\```bash
kubectl logs -n api-gateway -l app=api-gateway --tail=200 | grep -E "timeout|502|503|connection refused"
\```

2. Check the upstream service's health:
\```bash
kubectl get pods -n <upstream-namespace> -l app=<upstream-service>
\```

3. If the upstream is down, check if they have an active incident.
4. If no active incident, page the upstream team.
5. If the upstream cannot be restored quickly, enable the circuit breaker or
   degrade gracefully (return cached responses, disable the feature).

Scenario: Resource Exhaustion

### Resource Exhaustion (CPU or Memory)

**Symptoms:** Increased latency, pods being OOMKilled, high CPU throttling.

1. Check current resource usage:
\```bash
kubectl top pods -n api-gateway --sort-by=memory
\```

2. Check for OOMKill events:
\```bash
kubectl get events -n api-gateway --field-selector reason=OOMKilling --sort-by=.lastTimestamp
\```

3. Check CPU throttling:
\```promql
rate(container_cpu_cfs_throttled_seconds_total{namespace="api-gateway"}[5m])
\```

4. Temporary remediation -- increase limits:
\```bash
kubectl set resources deployment/api-gateway -n api-gateway \
  --requests=cpu=500m,memory=512Mi \
  --limits=cpu=1000m,memory=1Gi
\```

5. Create a ticket to investigate the root cause (memory leak, traffic spike,
   missing pagination on a query).

Runbook Testing and Maintenance#

A runbook that was written 18 months ago and never updated is dangerous. Commands may reference services that have been renamed. Links may point to dashboards that no longer exist. Escalation contacts may have left the company.

Testing Schedule#

Runbooks should be tested quarterly at minimum. Testing means an engineer who did not write the runbook follows it step by step and verifies that every command works, every link resolves, and every expected output matches reality.

## Runbook Testing Checklist

- [ ] All kubectl commands execute successfully against current cluster
- [ ] All Grafana dashboard links resolve to existing dashboards
- [ ] All PromQL queries return data (metrics still exist with expected labels)
- [ ] Escalation contacts are current (no departed employees)
- [ ] PagerDuty/Opsgenie rotation IDs are valid
- [ ] Remediation commands have been verified in a staging environment
- [ ] Side effects of remediation commands are accurately described
- [ ] Decision tree branches cover scenarios seen in the last quarter's incidents

Maintenance Triggers#

Update runbooks when any of these events occur:

A post-mortem identifies a gap in the runbook (a scenario that was not covered).
A service is renamed, moved to a different namespace, or re-architected.
The monitoring stack changes (new dashboards, new alert names, new metrics).
An escalation contact changes roles or leaves.
A remediation step no longer works or has new side effects.

Assign runbook ownership to the team that owns the service. Include a “Last Tested” date in the header. If a runbook has not been tested in 6 months, treat it as untrusted and schedule a review.

Integration with Alerting Systems#

The highest-value integration is linking alerts directly to their corresponding runbooks. When an engineer receives a page, the runbook link should be one click away.

Alertmanager Annotations#

groups:
  - name: api-gateway-alerts
    rules:
      - alert: APIHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="api-gateway",code=~"5.."}[5m]))
          / sum(rate(http_requests_total{job="api-gateway"}[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "API gateway error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/api-gateway-high-error-rate"
          dashboard_url: "https://grafana.internal/d/api-gw-overview?orgId=1"

PagerDuty and Opsgenie Integration#

Both PagerDuty and Opsgenie can display custom fields from alert annotations. Configure Alertmanager to forward the runbook_url annotation so it appears directly in the incident notification.

# Alertmanager config excerpt
receivers:
  - name: pagerduty-platform
    pagerduty_configs:
      - routing_key: "<key>"
        description: '{{ .CommonAnnotations.summary }}'
        details:
          runbook: '{{ .CommonAnnotations.runbook_url }}'
          dashboard: '{{ .CommonAnnotations.dashboard_url }}'
          firing_since: '{{ .CommonLabels.alertname }} firing since {{ .StartsAt }}'

When the engineer receives the PagerDuty notification on their phone, the runbook URL is right there. No searching through wikis or Slack history. This single integration consistently reduces MTTR by several minutes per incident, which compounds across hundreds of incidents per year.

Runbook Template#

Use this template as the starting point for every new runbook.

# Runbook: [Service/Alert Name]

**Service:** [service name]
**Team:** [owning team]
**Last Updated:** [date]
**Last Tested:** [date]
**Owner:** [person]

## Quick Summary
[One sentence: when does this runbook apply?]

## Impact
[What breaks? Who is affected?]

## Diagnosis

### Step 1: [First check]
[Command to run]
[Expected output]
[Branching: if X, go to step Y. If Z, go to remediation A.]

### Step 2: [Second check]
[Same structure]

## Remediation

### Scenario A: [Name]
[Steps with commands, expected outcomes, and side effects]

### Scenario B: [Name]
[Steps with commands, expected outcomes, and side effects]

## Escalation
[Tier 1-4 with contacts, timing, and what information to provide]

## References
- Grafana dashboard: [link]
- Architecture diagram: [link]
- Related runbooks: [links]
- Service repository: [link]

Consistency in format is more important than any specific section. When every runbook looks the same, engineers build muscle memory for navigating them under pressure. That muscle memory is what saves minutes during incidents.