On-Call Is a System, Not a Schedule#
On-call done wrong burns out engineers and degrades reliability simultaneously. Exhausted responders make worse decisions, and teams that dread on-call avoid owning production systems. Done right, on-call is sustainable, well-compensated, and generates signal that drives real reliability improvements.
Rotation Schedule Types#
Weekly Rotation#
Each engineer is primary on-call for one full week, Monday to Monday. This is the simplest model and works for teams of 5 or more in a single timezone.
Week 1: Alice (primary), Bob (secondary)
Week 2: Bob (primary), Carol (secondary)
Week 3: Carol (primary), Dave (secondary)
Week 4: Dave (primary), Alice (secondary)Advantages: simple to understand, clear ownership. Disadvantages: 7 consecutive days of interrupted sleep if nighttime pages are common. Only viable if your alert volume is low enough that on-call engineers can still do project work during the day.
Follow-the-Sun#
Split on-call across timezones so no engineer is paged during sleeping hours. A US team covers 08:00-20:00 US time, an EU team covers 08:00-20:00 EU time, and an APAC team covers the remaining window.
# PagerDuty follow-the-sun schedule
schedule:
name: "payment-api-follow-the-sun"
layers:
- name: "US Hours"
rotation_virtual_start: "2026-01-01T14:00:00Z" # 09:00 ET
rotation_turn_length_seconds: 604800 # 1 week
start: "14:00" # UTC
end: "01:00" # UTC (next day, = 20:00 ET)
users: [alice, bob, carol]
- name: "EU Hours"
rotation_virtual_start: "2026-01-01T07:00:00Z" # 08:00 CET
rotation_turn_length_seconds: 604800
start: "07:00"
end: "14:00"
users: [dieter, eva, franz]Requires at least 2-3 engineers per timezone. If you have a small team in one timezone, follow-the-sun is not an option – do not fake it by making one person cover overnight alone.
Hybrid Model#
Weekly primary with follow-the-sun override for overnight hours. The weekly primary handles business hours and is the escalation point. A shared overnight rotation covers sleeping hours with a higher severity threshold – only SEV-1 pages go out overnight.
Escalation Policies#
Every page must have an escalation path. If the primary does not acknowledge within a defined window, escalate automatically.
# PagerDuty escalation policy
escalation_policy:
name: "payment-api"
rules:
- escalation_delay_in_minutes: 5
targets:
- type: "schedule_reference"
id: "primary-oncall-schedule"
- escalation_delay_in_minutes: 10
targets:
- type: "schedule_reference"
id: "secondary-oncall-schedule"
- escalation_delay_in_minutes: 15
targets:
- type: "user_reference"
id: "engineering-manager"
repeat_enabled: true
num_loops: 2Key design decisions:
- Acknowledgment window: 5 minutes for SEV-1, 15 minutes for SEV-2, 30 minutes for SEV-3.
- Secondary on-call: Always have one. The secondary is not for overflow – they are the backup when the primary is unreachable.
- Manager escalation: After primary and secondary are exhausted, escalate to the engineering manager. They may not fix it, but they will find someone who can.
- Never escalate to the entire team. Paging everyone creates bystander effect where nobody responds because they assume someone else will.
Handoff Procedures#
A sloppy handoff loses context and causes repeated pages for known issues. Structured handoffs take 15 minutes and prevent hours of wasted investigation.
## On-Call Handoff Template
**Date**: 2026-02-17 → 2026-02-24
**Outgoing**: Alice | **Incoming**: Bob
### Active Issues
- payment-api: Elevated p99 latency (tracking in JIRA-4521).
Root cause identified as connection pool saturation. Fix deployed
but monitoring for recurrence. If it recurs, restart pod and escalate.
- search-service: Intermittent 503s from upstream provider.
Not actionable on our side. Suppress alert if < 1% error rate.
### Recent Changes
- 2026-02-19: Deployed auth-service v2.4.1 (new rate limiter)
- 2026-02-20: Database failover test completed successfully
### Alert Tuning Notes
- "disk-usage-critical" alert on logging-node-3 is a known false
positive. Ticket JIRA-4530 open to fix threshold.
### Environment Notes
- Staging environment is down for maintenance until Wednesday.Conduct handoffs synchronously – a 15-minute video call or in-person meeting. Async handoff documents are necessary but not sufficient. The outgoing on-call should walk through active issues and any unusual system behavior.
On-Call Compensation#
Uncompensated on-call is a retention disaster. Common models:
| Model | Description | Typical Range |
|---|---|---|
| Flat stipend | Fixed pay per on-call shift | $500-1500/week |
| Per-page bonus | Additional pay per page received | $50-200/page |
| Time-off-in-lieu | Comp day after on-call week | 1 day per week on-call |
| Hybrid | Base stipend + per-page + comp time | Most common at mature orgs |
Non-negotiable: if someone is woken up at 3 AM, they should not be expected in standup at 9 AM. Allow late starts or half-days after nighttime pages. This is not generous – it is necessary for sustained performance.
Alert Fatigue Mitigation#
Alert fatigue is the top on-call quality killer. When engineers receive too many alerts, they stop responding to any of them effectively.
Concrete anti-fatigue measures:
- Target: Fewer than 2 pages per on-call shift per day. More than that indicates alert tuning problems.
- Review every alert weekly: If an alert fired and required no human action, it should be automated or deleted.
- Severity-based routing: Only SEV-1 and SEV-2 page humans. SEV-3 goes to a Slack channel. SEV-4 goes to a ticket queue.
- Aggregate related alerts: If 10 pods restart within 5 minutes, send one alert about the deployment, not 10 individual pod alerts.
- Snooze and suppress: Allow on-call to snooze known-issue alerts for a bounded time window (max 24 hours) with a required ticket link.
# OpsGenie alert policy to suppress noise
alert_policy:
name: "suppress-known-pod-restarts"
conditions:
- field: "alias"
operation: "contains"
value: "pod-restart"
- field: "tags"
operation: "contains"
value: "known-issue-JIRA-4530"
actions:
- type: "suppress"
duration: "24h"On-Call Quality Metrics#
Track these monthly to measure on-call health:
| Metric | Target | Red Flag |
|-------------------------------|------------------|------------------|
| Pages per shift | < 2/day | > 5/day |
| Mean time to acknowledge | < 5 min | > 15 min |
| Mean time to resolve | < 30 min (SEV-2) | > 2 hours |
| Nighttime pages (11PM-7AM) | < 1/week | > 3/week |
| False positive rate | < 10% | > 30% |
| Escalation rate | < 5% | > 20% |
| On-call satisfaction (survey) | > 7/10 | < 5/10 |Build a Grafana dashboard pulling from PagerDuty or OpsGenie APIs. Review it in your monthly reliability review. If nighttime pages are consistently high, that is a reliability problem to fix, not an on-call problem to staff around.
The healthiest on-call rotations are boring. Engineers carry the pager, handle the rare real issue promptly, and spend most of their shift doing normal engineering work. If on-call is exciting, something is broken.