Why Run Exercises#

Runbooks that have never been tested are fiction. Failover procedures that have never been executed are hopes. Game days and tabletop exercises convert assumptions about system resilience into verified facts – or reveal that those assumptions were wrong before a real incident does.

The value is not just finding technical gaps. Exercises expose process gaps: unclear escalation paths, missing permissions, outdated contact lists, communication breakdowns between teams. These are invisible until a simulated failure forces people to actually follow the documented procedure.

Types of Exercises#

Tabletop Exercises#

A discussion-based exercise where a facilitator presents a failure scenario and participants walk through their response verbally. No systems are touched. Duration: 60-90 minutes.

Best for: testing incident response processes, evaluating communication flows, training new on-call engineers, exploring complex multi-failure scenarios that would be risky to inject live.

Tabletop Scenario Example:
"It is Tuesday at 2:15 PM. Your monitoring shows that the primary
database in us-east-1 has stopped accepting writes. Read replicas
are still serving reads. The auto-failover did not trigger.
Customer support reports that users cannot save data.

Questions for the team:
1. Who gets paged? What is the severity?
2. What is your first action?
3. How do you communicate to customers?
4. What is the manual failover procedure?
5. How do you verify the failover succeeded?
6. What is the rollback plan if failover makes things worse?"

Fault Injection Exercises#

Controlled injection of specific failures into a non-production or production environment using chaos engineering tools. Duration: 2-4 hours including prep and review.

Best for: validating that automated recovery works, testing circuit breakers and timeouts, measuring actual recovery times, verifying monitoring and alerting catches the failure.

# Chaos Mesh fault injection for a game day
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: gameday-kill-payment-pod
  namespace: production
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: payment-api
  duration: "60s"
  scheduler:
    cron: "@every 5m"   # Kill one pod every 5 minutes for duration

Full Game Days#

A multi-hour exercise combining fault injection with live incident response. The team responds as if it were a real incident: pages fire, the incident commander role is activated, communication channels are used, status pages are updated. Duration: half-day to full day.

Best for: end-to-end validation of incident response, testing cross-team coordination, measuring actual MTTR under realistic conditions, executive visibility into reliability posture.

Scenario Design#

Good scenarios are specific, plausible, and have measurable outcomes. Bad scenarios are vague (“what if the network goes down?”) or implausible (“what if all three regions fail simultaneously?”).

Scenario Template#

## Game Day Scenario: Database Failover

### Background
The payment-api service relies on a PostgreSQL primary with two
read replicas in us-east-1. Automatic failover is configured
via Patroni with a 30-second detection window.

### Injection
Kill the PostgreSQL primary process at a random time during the
exercise window (10:00 AM - 12:00 PM).

### Expected Behavior
1. Patroni detects primary failure within 30 seconds
2. Replica promoted to primary within 60 seconds
3. Application reconnects within 90 seconds
4. Write operations resume within 2 minutes total
5. Alert fires within 1 minute of primary failure
6. On-call acknowledges alert within 5 minutes

### Success Criteria
- [ ] Total write downtime < 2 minutes
- [ ] No data loss (verified by consistency check post-exercise)
- [ ] Alert fired and was acknowledged
- [ ] Status page updated within 10 minutes
- [ ] Customer notification sent (if applicable)

### Abort Criteria
- Write downtime exceeds 10 minutes with no recovery in sight
- Data corruption detected
- Unrelated production issues arise requiring attention

Design scenarios that test one thing well rather than everything at once. A database failover exercise should not simultaneously inject network partitions and disk failures.

Roles#

Every exercise needs clearly assigned roles before it begins.

Facilitator: Runs the exercise. Introduces the scenario, manages the timeline, injects failures (or directs the chaos tooling), takes notes on what happens, and calls abort if needed. The facilitator does not participate in the response – they observe and document.

Incident Commander: Practices the IC role as they would in a real incident. Coordinates responders, makes decisions, manages communication. In a tabletop, this is verbal. In a full game day, this is live.

Participants: Engineers who respond to the scenario. They follow runbooks, execute procedures, debug issues, and communicate status. They should treat the exercise as real – no looking at the scenario document for answers.

Observers: Stakeholders, managers, or engineers from other teams who watch silently and take notes. Observers provide feedback in the retrospective but do not intervene during the exercise.

## Game Day Roles Assignment

| Role               | Person    | Responsibilities                           |
|--------------------|-----------|--------------------------------------------|
| Facilitator        | Alice     | Run scenario, inject faults, take notes    |
| Incident Commander | Bob       | Coordinate response, make decisions        |
| Technical Lead     | Carol     | Debug and execute remediation              |
| Comms Lead         | Dave      | Update status page, notify stakeholders    |
| Observer           | Eva (EM)  | Watch and provide feedback post-exercise   |
| Observer           | Frank (PM)| Watch and provide feedback post-exercise   |

Runbook Validation#

Game days are the best opportunity to validate runbooks. Before the exercise, print or pull up the relevant runbooks. During the exercise, participants must follow them step by step.

Track every deviation:

## Runbook Validation Log

Runbook: "PostgreSQL Primary Failover" (wiki/runbooks/pg-failover)

| Step | Description              | Result  | Notes                          |
|------|--------------------------|---------|--------------------------------|
| 1    | Verify primary is down   | PASS    | Used pg_isready command        |
| 2    | Check Patroni status     | PASS    | patronictl list showed correct |
| 3    | Initiate manual failover | FAIL    | Command in runbook was wrong   |
|      |                          |         | Documented: patronictl failover|
|      |                          |         | Actual: patronictl switchover  |
| 4    | Verify new primary       | PASS    | Confirmed writes succeeding    |
| 5    | Update DNS               | SKIP    | Not needed with Patroni setup  |
| 6    | Notify stakeholders      | PASS    | Slack channel updated          |

Every FAIL and SKIP is a runbook improvement action item. Update the runbook within 48 hours of the exercise.

Post-Exercise Retrospective#

Run a 30-60 minute retrospective immediately after the exercise while details are fresh.

## Game Day Retrospective

**Date**: 2026-02-20
**Scenario**: Database failover
**Duration**: 2.5 hours (10:00 - 12:30)

### Timeline
- 10:00: Exercise started, scenario briefed
- 10:23: Primary database killed (injection)
- 10:24: Patroni detected failure (61 seconds)
- 10:25: Replica promoted (38 seconds after detection)
- 10:26: Application reconnected, writes resumed
- 10:27: Alert fired in PagerDuty
- 10:29: IC acknowledged and opened incident channel
- 10:42: Status page updated

### What Went Well
- Automated failover worked within expected parameters
- Team followed incident communication protocol correctly
- Application reconnection was seamless

### What Needs Improvement
- Alert fired 1 minute after writes resumed (alerting lag)
- Runbook step 3 had incorrect command
- Status page update took 13 minutes (target: 10 minutes)
- No one verified data consistency post-failover

### Action Items
| Action                                  | Owner  | Due        |
|-----------------------------------------|--------|------------|
| Fix alerting delay on DB primary loss   | Carol  | 2026-03-01 |
| Update failover runbook step 3          | Bob    | 2026-02-22 |
| Add data consistency check to runbook   | Carol  | 2026-02-28 |
| Practice status page updates in staging | Dave   | 2026-03-05 |

Scheduling Cadence#

| Exercise Type    | Frequency       | Duration    | Audience          |
|------------------|-----------------|-------------|-------------------|
| Tabletop         | Monthly         | 60-90 min   | On-call team      |
| Fault injection  | Quarterly       | 2-4 hours   | Service team      |
| Full game day    | Semi-annually   | Half day    | Cross-team + mgmt |
| DR failover test | Annually        | Full day    | All engineering   |

Rotate scenarios each time. Do not repeat the same exercise unless the previous one failed and you need to verify the fix. Keep a scenario backlog: every postmortem should generate at least one exercise scenario that tests whether the fix actually works. The best game day scenarios come directly from real incidents that you never want to repeat.