Production Readiness Reviews

Why Services Need a Gate Before Production#

Every production outage caused by a service that launched without monitoring, without runbooks, without capacity planning, without anyone knowing who owns it at 3 AM – every one of those was preventable. A production readiness review is the gate between “it works on my machine” and “it is ready for real users.” Google formalized this as the PRR process. You do not need Google-scale infrastructure to benefit from it.

The PRR is not a bureaucratic checkbox exercise. It is a structured conversation between the development team and the operations/SRE team that surfaces gaps before they become incidents.

The PRR Checklist#

Organize the review into five domains. Each item is pass/fail with concrete criteria.

Observability#

[ ] Service emits request rate, error rate, and latency metrics (RED)
[ ] Dashboard exists showing SLI metrics with 7-day and 30-day views
[ ] Structured logging is in place (JSON format, correlation IDs)
[ ] Distributed tracing is instrumented for all inbound/outbound calls
[ ] Alerts exist for SLO violations (burn rate based, not threshold)
[ ] Log retention meets compliance requirements (minimum 30 days)
[ ] Health check endpoint exists and is registered with load balancer

Reliability#

[ ] SLOs are defined and documented with error budget policy
[ ] Service handles dependency failures gracefully (timeouts, retries, circuit breakers)
[ ] Graceful degradation path exists for non-critical dependencies
[ ] Deployment rollback takes less than 5 minutes
[ ] Service can be restarted without data loss or corruption
[ ] Database migrations are backward compatible (can roll back one version)
[ ] No single points of failure in the architecture
[ ] Rate limiting is configured for all public endpoints

Security#

[ ] Authentication and authorization enforced on all endpoints
[ ] Secrets are managed via vault/secrets manager (not environment variables in code)
[ ] Network policies restrict traffic to required paths only
[ ] Dependencies are scanned for known vulnerabilities (Dependabot/Snyk)
[ ] Input validation on all user-facing endpoints
[ ] TLS enforced for all service-to-service communication
[ ] Audit logging for sensitive operations (data access, config changes)

Capacity#

[ ] Load testing completed at 2x expected peak traffic
[ ] Resource requests and limits are set (CPU, memory)
[ ] Horizontal pod autoscaler configured with appropriate thresholds
[ ] Database connection pool sized for expected concurrency
[ ] Storage growth projections documented for next 6 months
[ ] CDN/caching strategy defined for cacheable content
[ ] Dependency capacity confirmed (downstream services can handle your traffic)

Documentation and Operations#

[ ] Architecture diagram exists and is current
[ ] Runbook covers common failure modes and remediation steps
[ ] On-call team is identified and trained on the service
[ ] Escalation path is documented and configured in PagerDuty/OpsGenie
[ ] Data flow diagram shows PII/sensitive data paths
[ ] Deployment pipeline documented (how to deploy, rollback, feature flag)
[ ] Service catalog entry created (owner, dependencies, SLOs, contact)

Automated PRR Scoring#

Manual checklists work but do not scale. Build automated checks where possible and score the rest manually.

# prr-scorecard.yaml
service: payment-api
review_date: 2026-02-15
reviewer: sre-team

scoring:
  observability:
    metrics_instrumented: { score: 3, max: 3, automated: true }
    dashboard_exists: { score: 3, max: 3, automated: true }
    structured_logging: { score: 2, max: 3, note: "Missing correlation IDs" }
    distributed_tracing: { score: 3, max: 3, automated: true }
    slo_alerts: { score: 3, max: 3, automated: true }
    log_retention: { score: 3, max: 3 }
    health_check: { score: 3, max: 3, automated: true }
    subtotal: 20/21

  reliability:
    slos_defined: { score: 3, max: 3 }
    dependency_handling: { score: 2, max: 3, note: "No circuit breaker on search-service" }
    graceful_degradation: { score: 1, max: 3, note: "Hard fails when cache is unavailable" }
    rollback_time: { score: 3, max: 3, automated: true }
    restart_safety: { score: 3, max: 3 }
    migration_compat: { score: 3, max: 3 }
    no_spof: { score: 3, max: 3 }
    rate_limiting: { score: 3, max: 3 }
    subtotal: 21/24

  # ... security, capacity, documentation sections follow same pattern

overall:
  total_score: 87/105
  percentage: 82.8%
  result: "CONDITIONAL PASS"
  required_actions:
    - "Add circuit breaker for search-service dependency (due: 2026-03-01)"
    - "Implement graceful degradation when cache unavailable (due: 2026-03-01)"
    - "Add correlation IDs to structured logging (due: 2026-02-28)"

Scoring thresholds:

90-100%: PASS - Clear to launch
75-89%:  CONDITIONAL PASS - Launch with required action items (due within 30 days)
60-74%:  DEFER - Significant gaps, re-review after remediation
Below 60%: FAIL - Not ready for production traffic

Launch Gates#

Integrate PRR into your deployment pipeline as an actual gate, not just a meeting.

# GitHub Actions launch gate example
name: Production Launch Gate
on:
  pull_request:
    branches: [main]
    paths: ['deploy/production/**']

jobs:
  prr-gate:
    runs-on: ubuntu-latest
    steps:
      - name: Check PRR status
        run: |
          PRR_STATUS=$(curl -s https://internal-api/prr/payment-api/status)
          if [ "$PRR_STATUS" != "PASS" ] && [ "$PRR_STATUS" != "CONDITIONAL_PASS" ]; then
            echo "PRR status is $PRR_STATUS. Production deploy blocked."
            echo "Complete PRR at: https://wiki/prr/payment-api"
            exit 1
          fi

      - name: Check required actions
        run: |
          OVERDUE=$(curl -s https://internal-api/prr/payment-api/overdue-actions)
          if [ "$OVERDUE" != "0" ]; then
            echo "$OVERDUE overdue PRR action items. Resolve before deploying."
            exit 1
          fi

Graduation Criteria#

Not every service needs the full PRR. Define service tiers based on criticality and apply PRR depth accordingly.

Tier 1 (Critical): Full PRR required. Revenue-impacting, user-facing,
  or data-integrity services. Examples: payment processing, auth,
  primary API gateway.
  - Full checklist, all 5 domains
  - SRE team review required
  - Score >= 90% to pass

Tier 2 (Important): Standard PRR. Internal services with broad
  dependencies. Examples: notification service, search indexer,
  internal dashboards.
  - Full checklist, observability + reliability required
  - Peer review acceptable
  - Score >= 75% to pass

Tier 3 (Supporting): Lightweight PRR. Low-traffic internal tools,
  batch jobs, dev tooling.
  - Observability and documentation sections only
  - Self-assessment with spot checks
  - Score >= 60% to pass

Adapting for Smaller Teams#

Google’s PRR process assumes dedicated SRE teams reviewing services built by separate dev teams. If your team is 5-15 engineers who both build and operate their services, adapt:

Replace the SRE review board with peer review. The engineer who knows the least about a service reviews its PRR. This surfaces documentation gaps.
Automate 60% or more of the checklist. Write CI checks that verify metrics endpoints exist, dashboards are provisioned, health checks respond, and resource limits are set. Manual review only for judgment calls.
Bundle PRR with sprint milestones. Do not make PRR a separate event. Include it in the “definition of done” for the sprint where a service goes to production.
Keep the PRR document alive. It is not a one-time gate. Re-review when the service’s scope changes significantly, when it moves tiers, or annually.

The PRR process works because it forces teams to think about production concerns before they become production problems. A 2-hour review that catches a missing circuit breaker saves a 4-hour outage and a week of postmortem follow-up.