What a Post-Mortem Is and Is Not#

A post-mortem is a structured analysis of an incident conducted after the incident is resolved. Its purpose is to understand what happened, why it happened, and what changes will prevent it from happening again. It is not a blame assignment exercise. It is not a performance review. It is not a formality to check a compliance box.

The output of a good post-mortem is a set of concrete action items that improve the system. Not the humans – the system. If your post-mortem concludes with “engineer X should have been more careful,” you have failed at the process. Humans make mistakes. Systems should be designed so that human mistakes do not cause outages, and when they do, the blast radius is contained.

When to Write a Post-Mortem#

Not every incident needs a post-mortem. Define criteria so the team knows when one is required:

  • Always: Any incident that caused user-facing impact lasting more than 15 minutes.
  • Always: Any incident that required escalation beyond the primary on-call engineer.
  • Always: Any incident involving data loss or data corruption, regardless of duration.
  • Always: Any incident that was detected by users rather than monitoring.
  • Optional: Near-misses where a problem was caught before user impact but revealed a systemic risk.
  • Optional: Incidents resolved quickly but caused by a novel failure mode not seen before.

Near-miss post-mortems are some of the most valuable. They provide learning opportunities without the pressure and emotion of an active outage.

Constructing the Incident Timeline#

The timeline is the factual backbone of the post-mortem. It records what happened and when, without interpretation or judgment. Build it from objective sources: monitoring data, deployment logs, chat transcripts, and PagerDuty/Opsgenie event histories.

Data Sources for Timeline Construction#

Source                          What it tells you
---                             ---
Monitoring dashboards           When metrics deviated from normal
Alert history                   When alerts fired and were acknowledged
Deployment pipeline             When deployments occurred
Chat/Slack transcripts          When humans noticed, communicated, and acted
Version control                 What code or config changes were deployed
PagerDuty/Opsgenie logs         When pages were sent, acknowledged, escalated
Load balancer access logs       When error rates or latency changed
Database slow query logs        When queries started degrading

Timeline Format#

Use UTC timestamps consistently. Include both automated events (alerts, deployments) and human actions (decisions, commands run).

## Timeline (all times UTC)

| Time | Event |
|---|---|
| 14:02 | Deploy v2.34.1 begins rolling out to api-gateway (3 pods) |
| 14:04 | First pod updated, passes readiness probe |
| 14:06 | Second pod updated, readiness probe starts failing |
| 14:07 | Prometheus alert APIHighErrorRate fires (5xx rate > 1%) |
| 14:08 | PagerDuty pages on-call engineer (Alice) |
| 14:09 | Alice acknowledges the page |
| 14:12 | Alice opens Grafana dashboard, sees error rate at 8% |
| 14:14 | Alice checks pod logs, sees "connection refused" to auth-service |
| 14:16 | Alice suspects deployment issue, runs kubectl rollout undo |
| 14:18 | Rollback completes, error rate begins dropping |
| 14:22 | Error rate returns to baseline (< 0.1%) |
| 14:25 | Alice confirms resolution, closes the incident |

Timeline Best Practices#

Be precise about times. “Around 2 PM” is not useful. “14:07 UTC” is. Monitoring systems have exact timestamps – use them.

Separate observation from action. “Alice noticed the error rate was high” (observation) is different from “Alice rolled back the deployment” (action). Record both.

Include what did not happen. If the alert should have fired 3 minutes earlier but did not because the for duration was too long, that gap belongs in the timeline. If the runbook did not cover this scenario and the engineer had to improvise, note where the runbook failed.

Do not editorialize in the timeline. The timeline records facts. Analysis comes later. “Alice spent too long looking at the wrong dashboard” is judgment, not fact. “14:12-14:14: Alice reviewed the main dashboard, then switched to the pod-level view” is fact.

Root Cause Analysis Techniques#

The 5 Whys#

The 5 Whys technique drills past surface-level explanations to find systemic causes. Start with the problem statement and ask “why” repeatedly, following the causal chain.

Example: API outage caused by a bad deployment

Problem: API returned 5xx errors for 16 minutes.

Why? The api-gateway pod could not connect to the auth-service.

Why? The auth-service was deployed with an incompatible schema migration
     that broke the /validate endpoint.

Why? The schema migration was not tested against the api-gateway's
     request format.

Why? There is no integration test that validates the auth-service
     contract from the api-gateway's perspective.

Why? The services are developed by different teams, and there is no
     shared contract testing process.

Root cause: No cross-service contract testing in the CI pipeline.

5 Whys pitfalls to avoid:

  • Stopping too early. “The engineer deployed bad code” is not a root cause. Why was the code bad? Why was it not caught in review? Why did tests not catch it?
  • Blaming humans. If a “why” answer is “because engineer X made a mistake,” reframe it as a system question: “Why did the system allow that mistake to reach production?”
  • Single chain. Complex incidents often have multiple contributing causes. If the 5 Whys analysis feels like it is missing important factors, switch to a fishbone diagram.
  • Going too deep. If you reach “because the universe is entropic,” you have gone too far. Stop when you reach a cause you can actionably address.

Fishbone Diagram (Ishikawa)#

A fishbone diagram maps multiple contributing causes organized by category. It is more suitable than 5 Whys for complex incidents with several interacting failure modes.

The standard categories for infrastructure incidents:

                                    [INCIDENT]
                                        |
         +-----------+-----------+------+------+-----------+-----------+
         |           |           |             |           |           |
      Process     People     Technology    Monitoring   Environment  External
         |           |           |             |           |           |
    No deploy    New on-call  No circuit    Alert too   Traffic      Upstream
    freeze on    engineer     breaker on    slow to     spike from   provider
    Fridays      unfamiliar   auth-service  detect      marketing    API
                 with service  dependency   partial     campaign     changed
                              failure                                response
    No canary                                No         No auto-     format
    deployment   Runbook      Retry storm    dashboard   scaling
    process      missing      amplified      for auth-   configured
                 auth-service  failure       service
                 section                     errors

Each branch identifies a contributing factor. Not every factor is a root cause, but each one represents an opportunity to improve the system. The fishbone makes it visible that incidents are rarely caused by a single failure – they result from multiple small weaknesses aligning.

Contributing vs. Root Causes#

Distinguish between the root cause (the primary failure that triggered the incident) and contributing causes (factors that made the incident worse or harder to resolve).

## Root Cause
The auth-service schema migration was backward-incompatible with the
api-gateway's request format, causing /validate to return 500 errors.

## Contributing Factors
1. No contract tests between api-gateway and auth-service.
2. The deployment occurred on Friday afternoon with reduced staffing.
3. The auth-service error dashboard did not exist, delaying diagnosis by 4 min.
4. The on-call engineer was unfamiliar with auth-service architecture.
5. No circuit breaker on the api-gateway's auth-service client, so failures
   cascaded to all requests rather than degrading gracefully.

Each contributing factor becomes a candidate for an action item. The root cause gets the highest-priority action item.

Action Item Tracking#

Action items are the entire point of a post-mortem. Without tracked, assigned, deadlined action items, post-mortems are just stories.

Action Item Format#

Every action item needs five properties:

## Action Items

| ID | Action | Priority | Owner | Deadline | Status |
|---|---|---|---|---|---|
| PM-042-1 | Add contract tests for auth-service /validate endpoint | P1 | @bob | 2026-03-08 | Open |
| PM-042-2 | Implement circuit breaker on api-gateway auth client | P1 | @carol | 2026-03-15 | Open |
| PM-042-3 | Create auth-service error rate dashboard in Grafana | P2 | @dave | 2026-03-01 | Open |
| PM-042-4 | Add auth-service section to api-gateway runbook | P2 | @alice | 2026-02-28 | Open |
| PM-042-5 | Establish Friday deployment freeze policy (after 2 PM) | P3 | @eng-manager | 2026-03-15 | Open |

Priority Definitions#

  • P1: Directly prevents recurrence of this specific incident. Must be completed within 2 weeks.
  • P2: Reduces impact or detection time for similar incidents. Complete within 4 weeks.
  • P3: Improves resilience generally. Complete within a quarter.

Tracking Completion#

Action items that live only in the post-mortem document get forgotten. Create them as tickets in your issue tracker (Jira, Linear, GitHub Issues) and link them back to the post-mortem. Review outstanding post-mortem action items in your regular team planning meetings.

Track completion rates as a team metric. If post-mortems consistently produce action items that are never completed, the post-mortem process is wasting everyone’s time. A healthy completion rate is above 80% within the stated deadlines. Below 50% signals that either the action items are unrealistic or the team does not prioritize them.

Post-Mortem Template#

# Post-Mortem: [Incident Title]

**Date of Incident:** [date]
**Duration:** [total time of user-facing impact]
**Severity:** [SEV-1 / SEV-2 / SEV-3]
**Author:** [post-mortem author]
**Post-Mortem Date:** [date post-mortem was written]
**Review Meeting:** [date of review meeting]

## Summary
[2-3 sentences describing what happened, who was affected, and how it was
resolved. Write this for someone who has 30 seconds to understand the incident.]

## Impact
- **Users affected:** [number or percentage]
- **Duration of impact:** [minutes/hours]
- **Revenue impact:** [if applicable]
- **Data loss:** [yes/no, details]
- **SLO impact:** [error budget consumed]

## Timeline
[Chronological table of events, as described above]

## Root Cause
[Clear description of the primary technical failure]

## Contributing Factors
[Numbered list of factors that amplified the impact or delayed resolution]

## Root Cause Analysis
[5 Whys analysis or fishbone diagram]

## What Went Well
- [Things that worked as expected during the incident]
- [Good decisions made under pressure]
- [Monitoring that caught the issue quickly]

## What Went Poorly
- [Things that made the incident worse or harder to resolve]
- [Missing monitoring, documentation, or automation]
- [Communication gaps]

## Where We Got Lucky
- [Things that could have been worse but were not, due to chance]
- [Near-misses within the incident that did not escalate]

## Action Items
[Table with ID, action, priority, owner, deadline, status]

## Lessons Learned
[Key takeaways that apply beyond this specific incident]

The “Where We Got Lucky” Section#

This section is often overlooked but is one of the most valuable. It captures risks that did not materialize during this incident but could in the future.

Example: “The incident occurred during low-traffic hours (2 PM on a weekday). If this had happened during our peak traffic window (7-9 PM), the impact would have been 10x worse because the circuit breaker was not in place.” This observation strengthens the case for the circuit breaker action item and conveys urgency that the root cause section alone does not.

Running the Post-Mortem Review Meeting#

Before the Meeting#

  • The post-mortem author writes the document before the meeting. The meeting is for review and discussion, not for drafting.
  • Distribute the document at least 24 hours before the meeting so attendees arrive prepared.
  • Invite all engineers who were involved in the incident, the service owner, and the engineering manager.

During the Meeting (45-60 minutes)#

Agenda:
 0:00-0:05  Facilitator sets blameless tone, reviews ground rules
 0:05-0:15  Walk through the timeline (author presents, attendees correct)
 0:15-0:30  Discuss root cause and contributing factors
 0:30-0:40  Review action items (are they correct? complete? prioritized?)
 0:40-0:50  Open discussion: What did we miss? Are there broader implications?
 0:50-0:55  Assign action item owners and deadlines
 0:55-1:00  Wrap up: What is the single most important thing we learned?

Blameless Ground Rules#

State these explicitly at the start of every meeting:

  1. We assume everyone involved acted with the best intentions and the information available to them at the time.
  2. We focus on what the system allowed to happen, not who did what.
  3. “Human error” is never a root cause. It is a starting point for asking why the system did not prevent or catch the error.
  4. Every contributing factor is an opportunity to improve. We are not looking for a single person to blame.

Facilitation Tips#

The facilitator (ideally not the person who caused the incident, and ideally not their manager) keeps the conversation productive:

  • If discussion drifts toward blame (“why did you deploy without checking”), redirect: “Let’s talk about what checks the deployment pipeline should have performed automatically.”
  • If someone is defensive, acknowledge their perspective and move to systemic questions: “That makes sense given what you knew at the time. What could the system have shown you to change that decision?”
  • If action items are vague (“improve monitoring”), push for specificity: “Which metric? What threshold? What alert name? Who will implement it by when?”

Building a Learning Culture#

Post-mortems are only valuable if the organization learns from them. Several practices reinforce this.

Publish post-mortems widely. Store them in a searchable location accessible to all engineers, not buried in a team-specific wiki. When someone encounters a similar failure, they should find the previous post-mortem through search.

Weekly or monthly incident review. A regular meeting where teams present their post-mortems to a broader audience. This shares lessons across team boundaries and reveals patterns (if three teams independently discover they need circuit breakers, that is a platform-level investment).

Track recurrence. If the same root cause appears in multiple post-mortems, the action items from the first post-mortem were either incomplete or not implemented. This is a signal to escalate the systemic issue.

Celebrate good post-mortems. Recognize engineers who write thorough post-mortems and identify non-obvious contributing factors. The behavior you celebrate is the behavior you get. If post-mortems are treated as punishment, people will minimize incidents to avoid writing them.

No post-mortem amnesty. If action items are consistently not completed, acknowledge the capacity constraint honestly rather than letting items rot. It is better to explicitly de-prioritize an action item (with documented risk acceptance) than to leave it open indefinitely and erode trust in the process.

Practical Example: Database Connection Pool Exhaustion#

# Post-Mortem: API Outage Due to Database Connection Pool Exhaustion

**Date:** 2026-02-10
**Duration:** 23 minutes of user-facing impact
**Severity:** SEV-2
**Author:** Alice Chen

## Summary
The api-gateway exhausted its database connection pool after a slow query
caused connections to pile up. Users experienced 503 errors for 23 minutes
until the on-call engineer restarted the service and killed the blocking query.

## Timeline
| Time (UTC) | Event |
|---|---|
| 09:41 | Marketing campaign drives 3x normal traffic to /products endpoint |
| 09:44 | Slow query on products table begins (missing index on new column) |
| 09:47 | Connection pool reaches 90% capacity |
| 09:49 | Connection pool exhausted, new requests receive 503 |
| 09:50 | Alert fires: APIHighErrorRate |
| 09:51 | Alice acknowledges page, opens Grafana |
| 09:54 | Alice identifies connection pool exhaustion from metrics |
| 09:57 | Alice kills the blocking query via psql |
| 09:59 | Alice restarts api-gateway pods to clear stale connections |
| 10:03 | Connection pool recovers, error rate drops to baseline |
| 10:12 | Alice confirms full resolution |

## Root Cause
A new column added to the products table lacked an index. Under elevated
traffic, a query on this column became slow enough to hold connections open
longer than the pool could sustain.

## Contributing Factors
1. No query performance testing in CI for schema migrations.
2. Connection pool size was set to the default (10) and never tuned.
3. No alerting on connection pool utilization.
4. The marketing campaign was not communicated to the platform team.

## What Went Well
- Alert fired within 1 minute of user impact.
- On-call engineer correctly diagnosed connection pool exhaustion quickly.
- Runbook had a section on killing blocking queries.

## Where We Got Lucky
- The blocking query was a SELECT, not a write operation. A long-running
  write holding locks would have caused cascading failures in other services.

## Action Items
| ID | Action | Priority | Owner | Deadline |
|---|---|---|---|---|
| PM-051-1 | Add index on products.campaign_id | P1 | @bob | 2026-02-12 |
| PM-051-2 | Increase connection pool to 50, add pool utilization metric | P1 | @carol | 2026-02-17 |
| PM-051-3 | Add connection pool exhaustion alert (>80% for 2min) | P1 | @carol | 2026-02-17 |
| PM-051-4 | Add query performance regression test for migrations | P2 | @dave | 2026-03-01 |
| PM-051-5 | Establish process for marketing to notify platform before campaigns | P3 | @eng-mgr | 2026-03-15 |

This example demonstrates the complete post-mortem structure applied to a realistic incident. The root cause is technical and specific. The contributing factors identify multiple improvement opportunities. The action items are concrete, assigned, and deadlined. The “Where We Got Lucky” section flags a risk that did not materialize but should be addressed.