On-Call Is a System, Not a Schedule#

On-call done wrong burns out engineers and degrades reliability simultaneously. Exhausted responders make worse decisions, and teams that dread on-call avoid owning production systems. Done right, on-call is sustainable, well-compensated, and generates signal that drives real reliability improvements.

Rotation Schedule Types#

Weekly Rotation#

Each engineer is primary on-call for one full week, Monday to Monday. This is the simplest model and works for teams of 5 or more in a single timezone.

Week 1: Alice (primary), Bob (secondary)
Week 2: Bob (primary), Carol (secondary)
Week 3: Carol (primary), Dave (secondary)
Week 4: Dave (primary), Alice (secondary)

Advantages: simple to understand, clear ownership. Disadvantages: 7 consecutive days of interrupted sleep if nighttime pages are common. Only viable if your alert volume is low enough that on-call engineers can still do project work during the day.

Follow-the-Sun#

Split on-call across timezones so no engineer is paged during sleeping hours. A US team covers 08:00-20:00 US time, an EU team covers 08:00-20:00 EU time, and an APAC team covers the remaining window.

# PagerDuty follow-the-sun schedule
schedule:
  name: "payment-api-follow-the-sun"
  layers:
    - name: "US Hours"
      rotation_virtual_start: "2026-01-01T14:00:00Z"  # 09:00 ET
      rotation_turn_length_seconds: 604800  # 1 week
      start: "14:00"  # UTC
      end: "01:00"    # UTC (next day, = 20:00 ET)
      users: [alice, bob, carol]
    - name: "EU Hours"
      rotation_virtual_start: "2026-01-01T07:00:00Z"  # 08:00 CET
      rotation_turn_length_seconds: 604800
      start: "07:00"
      end: "14:00"
      users: [dieter, eva, franz]

Requires at least 2-3 engineers per timezone. If you have a small team in one timezone, follow-the-sun is not an option – do not fake it by making one person cover overnight alone.

Hybrid Model#

Weekly primary with follow-the-sun override for overnight hours. The weekly primary handles business hours and is the escalation point. A shared overnight rotation covers sleeping hours with a higher severity threshold – only SEV-1 pages go out overnight.

Escalation Policies#

Every page must have an escalation path. If the primary does not acknowledge within a defined window, escalate automatically.

# PagerDuty escalation policy
escalation_policy:
  name: "payment-api"
  rules:
    - escalation_delay_in_minutes: 5
      targets:
        - type: "schedule_reference"
          id: "primary-oncall-schedule"
    - escalation_delay_in_minutes: 10
      targets:
        - type: "schedule_reference"
          id: "secondary-oncall-schedule"
    - escalation_delay_in_minutes: 15
      targets:
        - type: "user_reference"
          id: "engineering-manager"
  repeat_enabled: true
  num_loops: 2

Key design decisions:

  • Acknowledgment window: 5 minutes for SEV-1, 15 minutes for SEV-2, 30 minutes for SEV-3.
  • Secondary on-call: Always have one. The secondary is not for overflow – they are the backup when the primary is unreachable.
  • Manager escalation: After primary and secondary are exhausted, escalate to the engineering manager. They may not fix it, but they will find someone who can.
  • Never escalate to the entire team. Paging everyone creates bystander effect where nobody responds because they assume someone else will.

Handoff Procedures#

A sloppy handoff loses context and causes repeated pages for known issues. Structured handoffs take 15 minutes and prevent hours of wasted investigation.

## On-Call Handoff Template

**Date**: 2026-02-17 → 2026-02-24
**Outgoing**: Alice  |  **Incoming**: Bob

### Active Issues
- payment-api: Elevated p99 latency (tracking in JIRA-4521).
  Root cause identified as connection pool saturation. Fix deployed
  but monitoring for recurrence. If it recurs, restart pod and escalate.
- search-service: Intermittent 503s from upstream provider.
  Not actionable on our side. Suppress alert if < 1% error rate.

### Recent Changes
- 2026-02-19: Deployed auth-service v2.4.1 (new rate limiter)
- 2026-02-20: Database failover test completed successfully

### Alert Tuning Notes
- "disk-usage-critical" alert on logging-node-3 is a known false
  positive. Ticket JIRA-4530 open to fix threshold.

### Environment Notes
- Staging environment is down for maintenance until Wednesday.

Conduct handoffs synchronously – a 15-minute video call or in-person meeting. Async handoff documents are necessary but not sufficient. The outgoing on-call should walk through active issues and any unusual system behavior.

On-Call Compensation#

Uncompensated on-call is a retention disaster. Common models:

Model Description Typical Range
Flat stipend Fixed pay per on-call shift $500-1500/week
Per-page bonus Additional pay per page received $50-200/page
Time-off-in-lieu Comp day after on-call week 1 day per week on-call
Hybrid Base stipend + per-page + comp time Most common at mature orgs

Non-negotiable: if someone is woken up at 3 AM, they should not be expected in standup at 9 AM. Allow late starts or half-days after nighttime pages. This is not generous – it is necessary for sustained performance.

Alert Fatigue Mitigation#

Alert fatigue is the top on-call quality killer. When engineers receive too many alerts, they stop responding to any of them effectively.

Concrete anti-fatigue measures:

  • Target: Fewer than 2 pages per on-call shift per day. More than that indicates alert tuning problems.
  • Review every alert weekly: If an alert fired and required no human action, it should be automated or deleted.
  • Severity-based routing: Only SEV-1 and SEV-2 page humans. SEV-3 goes to a Slack channel. SEV-4 goes to a ticket queue.
  • Aggregate related alerts: If 10 pods restart within 5 minutes, send one alert about the deployment, not 10 individual pod alerts.
  • Snooze and suppress: Allow on-call to snooze known-issue alerts for a bounded time window (max 24 hours) with a required ticket link.
# OpsGenie alert policy to suppress noise
alert_policy:
  name: "suppress-known-pod-restarts"
  conditions:
    - field: "alias"
      operation: "contains"
      value: "pod-restart"
    - field: "tags"
      operation: "contains"
      value: "known-issue-JIRA-4530"
  actions:
    - type: "suppress"
      duration: "24h"

On-Call Quality Metrics#

Track these monthly to measure on-call health:

| Metric                        | Target           | Red Flag         |
|-------------------------------|------------------|------------------|
| Pages per shift               | < 2/day          | > 5/day          |
| Mean time to acknowledge      | < 5 min          | > 15 min         |
| Mean time to resolve          | < 30 min (SEV-2) | > 2 hours        |
| Nighttime pages (11PM-7AM)    | < 1/week         | > 3/week         |
| False positive rate           | < 10%            | > 30%            |
| Escalation rate               | < 5%             | > 20%            |
| On-call satisfaction (survey) | > 7/10           | < 5/10           |

Build a Grafana dashboard pulling from PagerDuty or OpsGenie APIs. Review it in your monthly reliability review. If nighttime pages are consistently high, that is a reliability problem to fix, not an on-call problem to staff around.

The healthiest on-call rotations are boring. Engineers carry the pager, handle the rare real issue promptly, and spend most of their shift doing normal engineering work. If on-call is exciting, something is broken.