Incident Management Lifecycle

Incident Lifecycle Overview#

An incident is an unplanned disruption to a service requiring coordinated response. The lifecycle has six phases: detection, triage, communication, mitigation, resolution, and review. Each has defined actions, owners, and exit criteria.

Phase 1: Detection#

Incidents are detected through three channels. Automated monitoring is best – alerts fire on SLO violations or error thresholds before users notice. Internal reports come from other teams noticing issues with dependencies. Customer reports are worst case – if users detect your incidents first, your observability has gaps.

When an alert fires, confirm it is not a false positive, determine scope (one service or multiple, one region or global), check for recent changes in the last 2 hours, and check for known ongoing issues.

Phase 2: Triage#

Triage answers: how bad is this, and who handles it?

Severity	User Impact	Response Time
SEV-1	Complete outage or data loss for all users	Immediate, all-hands
SEV-2	Major feature unavailable for many users	Within 15 minutes
SEV-3	Minor feature degraded or subset affected	Within 1 hour
SEV-4	Minimal impact, internal tooling	Next business day

Triage decision tree: Is the service completely unavailable? SEV-1. Are core user-facing features degraded for over 50% of users? SEV-2. Degraded for a subset? SEV-3. Only internal operations affected? SEV-4. No real impact? Not an incident – file a bug.

After classifying severity: create an incident record, open a dedicated communication channel, and page appropriate responders.

Phase 3: Communication#

Roles#

Incident Commander (IC): Owns the incident. Makes decisions on severity, resource allocation, escalation. Coordinates – does not debug.

Technical Lead: Actively debugging and fixing. Reports findings to the IC. May be multiple leads for complex incidents.

Communications Lead: Updates the status page, posts to stakeholder channels, drafts customer notifications.

Scribe: Documents the timeline in real-time with timestamps. This becomes the post-incident review foundation.

Communication Cadence#

SEV-1: status page update within 5 minutes, stakeholder updates every 15 minutes. SEV-2: status page within 15 minutes, stakeholder updates every 30 minutes. SEV-3/4: status page if customer-facing, updates at declaration and resolution.

Templates#

Investigating:

We are aware of [symptoms] affecting [component]. Our team is
actively investigating. We will provide updates every [cadence].

Identified:

We have identified the cause as [brief explanation]. We are
implementing [mitigation]. Expected resolution: [timeframe].

Resolved:

The issue has been resolved as of [timestamp]. [Brief summary
of cause and fix]. Duration: [time]. We will conduct a
post-incident review.

Phase 4: Mitigation#

Mitigation restores service. It does not fix root cause – it stops the bleeding. Use the fastest strategy available:

Rollback the last change. If a deployment happened recently, revert first, investigate later.
Scale up. If the system is overwhelmed, add capacity.
Shed load. Enable rate limiting, disable non-critical features.
Failover. Switch to a redundant system or secondary region.
Restart. Buys time even if it does not solve root cause.
Hotfix. Targeted code or config change. Slowest option – use only when rollback is impossible.

During mitigation: announce every action before executing it, make one change at a time, wait 2-5 minutes after each action to observe metrics, and revert any mitigation that does not work before trying the next.

Phase 5: Resolution#

Resolution is confirmed when service metrics return to baseline for at least 15 minutes, synthetic checks pass, and the fix is verified across all affected environments.

The IC declares resolved with a timestamp, the communications lead posts the resolution update, and a post-incident review is scheduled (within 48 hours for SEV-1/2, one week for SEV-3).

Phase 6: Post-Incident Review#

The review exists to learn and improve, not to blame. Use blameless language and focus on systems and processes.

Review Structure#

Timeline: Minute-by-minute reconstruction from scribe notes, chat logs, and monitoring data.

Impact: Duration, affected users, failed requests, SLO budget consumed.

Contributing factors: Systemic issues that led to the incident – missing monitoring, untested failure mode, unclear runbook.

Action items categorized as:

Prevent: Stop this class of incident from happening.
Detect: Catch it faster next time.
Mitigate: Reduce impact or speed recovery.

Review Template#

## Post-Incident Review: [Title]
**Severity:** SEV-X | **Duration:** X hours | **IC:** [Name]

### Summary
[2-3 sentences: what happened, what was the impact]

### Timeline
| Time (UTC) | Event |
|---|---|
| HH:MM | Alert fired |
| HH:MM | IC declared incident |
| HH:MM | Root cause identified |
| HH:MM | Resolution confirmed |

### Contributing Factors
1. [Factor]

### What Went Well
- [Item]

### What Could Be Improved
- [Item]

### Action Items
| Action | Type | Owner | Due Date |
|---|---|---|---|
| [Action] | Prevent | [Team] | [Date] |
| [Action] | Detect | [Team] | [Date] |
| [Action] | Mitigate | [Team] | [Date] |

Step-by-Step Incident Playbook#

Alert received. Acknowledge. Open the relevant dashboard.
Assess scope. One service or multiple? One region or global? Check recent changes.
Classify severity. Use the triage decision tree.
Declare the incident. Create an incident record. Open a channel. Page responders.
Assign roles. IC, Technical Lead, Communications Lead, Scribe.
Update status page. Post investigating status within the SLA for the severity.
Investigate and mitigate. Fastest option first. One change at a time. Announce before acting.
Communicate on cadence. Post updates at defined intervals. If no new info, say so.
Confirm resolution. Metrics at baseline, checks passing. Wait the bake period.
Declare resolved. Post resolution to status page and stakeholders. Record final metrics.
Schedule review. Set the post-incident review within the SLA for the severity.
Complete review. Write the document. Assign action items with owners and due dates.

Agent Operational Notes#

Defer to the human IC. An agent gathers data, runs diagnostics, and suggests mitigations. A human makes the final call.
Surface context, not conclusions. Present metrics, recent changes, and related past incidents. Let humans draw conclusions.
Timestamp everything. Every agent action gets a UTC timestamp in the incident channel.
Do not execute mitigation without approval unless operating within a pre-approved automated runbook.
Maintain the timeline. If acting as scribe, capture every event, decision, and action. Err on the side of too much detail.