Reliability Review Process

Why Regular Reviews Matter#

Reliability does not improve by accident. Without a structured review cadence, teams operate on vibes – “things feel okay” or “we’ve been having a lot of incidents lately.” Reliability reviews replace gut feelings with data. They surface slow-burning problems before they become outages, hold teams accountable for improvement actions, and create a shared understanding of system health across engineering and leadership.

Weekly Reliability Review#

The weekly review is a 30-minute tactical meeting focused on what happened this week and what needs attention next week. Attendees: on-call engineers, team leads, SRE. Keep it tight.

Agenda Template#

## Weekly Reliability Review - [Date]
Duration: 30 minutes

### 1. On-Call Summary (10 min)
- Total pages this week: ___
- SEV-1/SEV-2 incidents: ___
- Notable alerts (new, recurring, false positives): ___
- On-call engineer's assessment: Quiet / Normal / Painful

### 2. Error Budget Status (5 min)
Review the error budget dashboard for all Tier-1 services.
Flag any service below 50% remaining budget.

| Service      | Budget Remaining | Trend   | Action Needed? |
|-------------|-----------------|---------|----------------|
| payment-api  | 62%             | Stable  | No             |
| search-api   | 18%             | Down    | Yes            |
| auth-service | 88%             | Stable  | No             |

### 3. Open Action Items (10 min)
Review items from previous reviews and incidents.
Update status. Escalate blocked items.

| Item                              | Owner  | Status      | Due        |
|-----------------------------------|--------|-------------|------------|
| Fix connection pool sizing        | Carol  | In Progress | 2026-02-28 |
| Add circuit breaker to search-svc | Dave   | Blocked     | 2026-02-25 |
| Update failover runbook           | Bob    | Done        | 2026-02-20 |

### 4. Upcoming Risks (5 min)
- Planned deployments this week
- Maintenance windows
- Known upcoming traffic events
- Dependency changes or deprecations

The Dashboard#

Build a single Grafana dashboard that the weekly review opens on screen. It should answer these questions at a glance:

# Grafana dashboard panels for weekly reliability review
panels:
  - title: "SLO Status - All Services"
    type: stat
    query: "slo:error_budget_remaining:ratio"
    thresholds: [0.25, 0.50, 0.75]  # red, orange, yellow, green

  - title: "Pages This Week"
    type: stat
    query: "increase(pagerduty_incidents_total[7d])"

  - title: "Error Rate Trend (7d)"
    type: timeseries
    query: "sum(rate(http_requests_total{status=~'5..'}[1h])) by (service)"

  - title: "P99 Latency Trend (7d)"
    type: timeseries
    query: "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))"

  - title: "Deployment Frequency (7d)"
    type: stat
    query: "increase(deployments_total[7d])"

  - title: "Change Failure Rate (7d)"
    type: stat
    query: |
      increase(deployments_rollback_total[7d])
      / increase(deployments_total[7d])

Monthly Reliability Review#

The monthly review is a 60-minute strategic meeting that looks at trends, systemic issues, and cross-team dependencies. Attendees: engineering leads, SRE, product management, engineering manager.

Agenda Template#

## Monthly Reliability Review - [Month Year]
Duration: 60 minutes

### 1. SLO Performance Summary (10 min)
Review 30-day SLO performance for all services.

| Service      | SLO    | Achieved | Budget Consumed | Incidents |
|-------------|--------|----------|-----------------|-----------|
| payment-api  | 99.9%  | 99.92%   | 38%             | 1 SEV-2   |
| search-api   | 99.5%  | 99.1%    | 82%             | 3 SEV-3   |
| auth-service | 99.9%  | 99.97%   | 12%             | 0         |
| data-pipeline| 99.5%  | 99.8%    | 22%             | 0         |

### 2. Incident Trend Analysis (15 min)
- Total incidents this month vs previous month
- Incidents by severity
- Incidents by service
- Incidents by root cause category
- Repeat incidents (same root cause as previous incident)
- Mean time to detect, acknowledge, mitigate, resolve

### 3. Dependency Risk Assessment (15 min)
Review health of critical dependencies.

| Dependency        | Type     | Risk Level | Issues This Month     |
|-------------------|----------|------------|----------------------|
| AWS RDS           | Infra    | Low        | None                 |
| Stripe API        | External | Medium     | 2 degraded periods   |
| Redis cluster     | Infra    | High       | Memory pressure, OOM |
| search-provider   | External | Medium     | Latency spikes       |

### 4. Reliability Project Status (10 min)
- Automation projects in progress
- Toil reduction initiatives
- Capacity planning updates
- Upcoming architecture changes

### 5. Action Items and Priorities (10 min)
- New action items from this review
- Priority stack rank for reliability work next month
- Resource requests or escalations

Incident Trend Analysis#

Tracking individual incidents is necessary. Tracking trends across incidents is where the real insight lives. Categorize every incident by root cause:

Root Cause Categories:
- Configuration change (bad config push, feature flag)
- Code defect (bug in application logic)
- Capacity (traffic spike, resource exhaustion)
- Dependency failure (upstream/downstream service)
- Infrastructure (hardware, cloud provider, network)
- Human error (manual procedure gone wrong)
- Security (attack, vulnerability exploitation)

Plot these monthly:

Month     | Config | Code | Capacity | Dependency | Infra | Human |
----------|--------|------|----------|------------|-------|-------|
November  |   2    |  3   |    1     |     2      |   0   |   1   |
December  |   1    |  2   |    3     |     1      |   1   |   0   |
January   |   4    |  1   |    1     |     3      |   0   |   2   |
February  |   3    |  2   |    0     |     2      |   1   |   1   |

In this example, configuration changes are trending up. That signals a need for better config validation, canary deployments for config changes, or config-as-code review processes. Without the trend view, each individual config incident looks like a one-off. Together, they reveal a systemic problem.

Repeat Incident Detection#

Flag any incident that shares a root cause with an incident from the previous 90 days. Repeat incidents indicate that previous postmortem action items either were not completed or did not fix the underlying issue.

-- Query for repeat incidents (assumes incident tracking DB)
SELECT
  i1.id as current_incident,
  i2.id as previous_incident,
  i1.root_cause_category,
  i1.service,
  i1.created_at,
  i2.created_at as previous_date
FROM incidents i1
JOIN incidents i2
  ON i1.root_cause_category = i2.root_cause_category
  AND i1.service = i2.service
  AND i1.created_at > i2.created_at
  AND i1.created_at - i2.created_at < INTERVAL '90 days'
WHERE i1.created_at >= NOW() - INTERVAL '30 days'
ORDER BY i1.created_at DESC;

Dependency Risk Assessment#

Every external and internal dependency is a risk vector. Maintain a dependency risk register and review it monthly.

Score each dependency on three dimensions:

Impact (1-5): How badly does failure of this dependency affect users?
Likelihood (1-5): How often does this dependency have issues?
Mitigation (1-5): How well can we handle a failure? (5 = fully mitigated)

Risk Score = Impact × Likelihood × (6 - Mitigation)

Example:
  Stripe API: Impact=5, Likelihood=2, Mitigation=3 → 5×2×3 = 30
  Redis cache: Impact=3, Likelihood=3, Mitigation=4 → 3×3×2 = 18
  Internal auth: Impact=5, Likelihood=1, Mitigation=2 → 5×1×4 = 20

Any dependency scoring above 40 needs an active mitigation project. Between 20-40, it should be on the reliability roadmap. Below 20, monitor and reassess quarterly.

Action Item Tracking#

Reliability reviews generate action items. Without rigorous tracking, those items rot. Use a dedicated tracker – not a general project board where they get lost among feature work.

# Reliability action item schema
action_items:
  - id: REL-042
    title: "Add circuit breaker to search-api → search-provider"
    source: "Monthly review 2026-02"
    owner: Carol
    priority: P1
    due_date: 2026-03-15
    status: in_progress
    linked_incidents: [INC-301, INC-307]
    notes: "Using resilience4j. PR in review."

  - id: REL-043
    title: "Increase Redis cluster memory by 50%"
    source: "Weekly review 2026-02-17"
    owner: Dave
    priority: P1
    due_date: 2026-02-28
    status: blocked
    blocker: "Waiting on budget approval from finance"
    escalated_to: engineering-manager

Track completion rate monthly. If fewer than 70% of action items are completed by their due date, either the team is overloaded, the due dates are unrealistic, or reliability work is being deprioritized. All three are problems worth surfacing in the monthly review.

The reliability review process works because it creates a regular heartbeat of attention on system health. When reviews happen consistently and action items are tracked to completion, reliability improves measurably quarter over quarter. When reviews are skipped or treated as optional, reliability degrades silently until it surfaces as a major incident.