On-Call Rotation Design

On-Call Is a System, Not a Schedule#

On-call done wrong burns out engineers and degrades reliability simultaneously. Exhausted responders make worse decisions, and teams that dread on-call avoid owning production systems. Done right, on-call is sustainable, well-compensated, and generates signal that drives real reliability improvements.

Rotation Schedule Types#

Weekly Rotation#

Each engineer is primary on-call for one full week, Monday to Monday. This is the simplest model and works for teams of 5 or more in a single timezone.

Week 1: Alice (primary), Bob (secondary)
Week 2: Bob (primary), Carol (secondary)
Week 3: Carol (primary), Dave (secondary)
Week 4: Dave (primary), Alice (secondary)

Advantages: simple to understand, clear ownership. Disadvantages: 7 consecutive days of interrupted sleep if nighttime pages are common. Only viable if your alert volume is low enough that on-call engineers can still do project work during the day.

Follow-the-Sun#

Split on-call across timezones so no engineer is paged during sleeping hours. A US team covers 08:00-20:00 US time, an EU team covers 08:00-20:00 EU time, and an APAC team covers the remaining window.

# PagerDuty follow-the-sun schedule
schedule:
  name: "payment-api-follow-the-sun"
  layers:
    - name: "US Hours"
      rotation_virtual_start: "2026-01-01T14:00:00Z"  # 09:00 ET
      rotation_turn_length_seconds: 604800  # 1 week
      start: "14:00"  # UTC
      end: "01:00"    # UTC (next day, = 20:00 ET)
      users: [alice, bob, carol]
    - name: "EU Hours"
      rotation_virtual_start: "2026-01-01T07:00:00Z"  # 08:00 CET
      rotation_turn_length_seconds: 604800
      start: "07:00"
      end: "14:00"
      users: [dieter, eva, franz]

Requires at least 2-3 engineers per timezone. If you have a small team in one timezone, follow-the-sun is not an option – do not fake it by making one person cover overnight alone.

Hybrid Model#

Weekly primary with follow-the-sun override for overnight hours. The weekly primary handles business hours and is the escalation point. A shared overnight rotation covers sleeping hours with a higher severity threshold – only SEV-1 pages go out overnight.

Escalation Policies#

Every page must have an escalation path. If the primary does not acknowledge within a defined window, escalate automatically.

# PagerDuty escalation policy
escalation_policy:
  name: "payment-api"
  rules:
    - escalation_delay_in_minutes: 5
      targets:
        - type: "schedule_reference"
          id: "primary-oncall-schedule"
    - escalation_delay_in_minutes: 10
      targets:
        - type: "schedule_reference"
          id: "secondary-oncall-schedule"
    - escalation_delay_in_minutes: 15
      targets:
        - type: "user_reference"
          id: "engineering-manager"
  repeat_enabled: true
  num_loops: 2

Key design decisions:

Acknowledgment window: 5 minutes for SEV-1, 15 minutes for SEV-2, 30 minutes for SEV-3.
Secondary on-call: Always have one. The secondary is not for overflow – they are the backup when the primary is unreachable.
Manager escalation: After primary and secondary are exhausted, escalate to the engineering manager. They may not fix it, but they will find someone who can.
Never escalate to the entire team. Paging everyone creates bystander effect where nobody responds because they assume someone else will.

Handoff Procedures#

A sloppy handoff loses context and causes repeated pages for known issues. Structured handoffs take 15 minutes and prevent hours of wasted investigation.

## On-Call Handoff Template

**Date**: 2026-02-17 → 2026-02-24
**Outgoing**: Alice  |  **Incoming**: Bob

### Active Issues
- payment-api: Elevated p99 latency (tracking in JIRA-4521).
  Root cause identified as connection pool saturation. Fix deployed
  but monitoring for recurrence. If it recurs, restart pod and escalate.
- search-service: Intermittent 503s from upstream provider.
  Not actionable on our side. Suppress alert if < 1% error rate.

### Recent Changes
- 2026-02-19: Deployed auth-service v2.4.1 (new rate limiter)
- 2026-02-20: Database failover test completed successfully

### Alert Tuning Notes
- "disk-usage-critical" alert on logging-node-3 is a known false
  positive. Ticket JIRA-4530 open to fix threshold.

### Environment Notes
- Staging environment is down for maintenance until Wednesday.

Conduct handoffs synchronously – a 15-minute video call or in-person meeting. Async handoff documents are necessary but not sufficient. The outgoing on-call should walk through active issues and any unusual system behavior.

On-Call Compensation#

Uncompensated on-call is a retention disaster. Common models:

Model	Description	Typical Range
Flat stipend	Fixed pay per on-call shift	$500-1500/week
Per-page bonus	Additional pay per page received	$50-200/page
Time-off-in-lieu	Comp day after on-call week	1 day per week on-call
Hybrid	Base stipend + per-page + comp time	Most common at mature orgs

Non-negotiable: if someone is woken up at 3 AM, they should not be expected in standup at 9 AM. Allow late starts or half-days after nighttime pages. This is not generous – it is necessary for sustained performance.

Alert Fatigue Mitigation#

Alert fatigue is the top on-call quality killer. When engineers receive too many alerts, they stop responding to any of them effectively.

Concrete anti-fatigue measures:

Target: Fewer than 2 pages per on-call shift per day. More than that indicates alert tuning problems.
Review every alert weekly: If an alert fired and required no human action, it should be automated or deleted.
Severity-based routing: Only SEV-1 and SEV-2 page humans. SEV-3 goes to a Slack channel. SEV-4 goes to a ticket queue.
Aggregate related alerts: If 10 pods restart within 5 minutes, send one alert about the deployment, not 10 individual pod alerts.
Snooze and suppress: Allow on-call to snooze known-issue alerts for a bounded time window (max 24 hours) with a required ticket link.

# OpsGenie alert policy to suppress noise
alert_policy:
  name: "suppress-known-pod-restarts"
  conditions:
    - field: "alias"
      operation: "contains"
      value: "pod-restart"
    - field: "tags"
      operation: "contains"
      value: "known-issue-JIRA-4530"
  actions:
    - type: "suppress"
      duration: "24h"

On-Call Quality Metrics#

Track these monthly to measure on-call health:

| Metric                        | Target           | Red Flag         |
|-------------------------------|------------------|------------------|
| Pages per shift               | < 2/day          | > 5/day          |
| Mean time to acknowledge      | < 5 min          | > 15 min         |
| Mean time to resolve          | < 30 min (SEV-2) | > 2 hours        |
| Nighttime pages (11PM-7AM)    | < 1/week         | > 3/week         |
| False positive rate           | < 10%            | > 30%            |
| Escalation rate               | < 5%             | > 20%            |
| On-call satisfaction (survey) | > 7/10           | < 5/10           |

Build a Grafana dashboard pulling from PagerDuty or OpsGenie APIs. Review it in your monthly reliability review. If nighttime pages are consistently high, that is a reliability problem to fix, not an on-call problem to staff around.

The healthiest on-call rotations are boring. Engineers carry the pager, handle the rare real issue promptly, and spend most of their shift doing normal engineering work. If on-call is exciting, something is broken.