The SRE Model#
Site Reliability Engineering treats operations as a software engineering problem. Instead of a wall between developers who ship features and operators who keep things running, SRE defines reliability as a feature – one that can be measured, budgeted, and traded against velocity. The core insight is that 100% reliability is the wrong target. Users cannot tell the difference between 99.99% and 100%, but the engineering cost to close that gap is enormous. SRE makes this tradeoff explicit through service level objectives.
SLIs, SLOs, and SLAs#
These three concepts form a hierarchy. Get them right and every conversation about reliability becomes concrete instead of subjective.
Service Level Indicators (SLIs) are the raw measurements. An SLI is a quantitative measure of some aspect of service health. Good SLIs are user-facing and directly observable:
- Availability: The proportion of requests that succeed. Measured as
successful_requests / total_requestsover a time window. - Latency: The proportion of requests served faster than a threshold. Measured as
requests_under_300ms / total_requests. Use a distribution, not an average – p50, p95, and p99 tell different stories. - Throughput: Requests processed per second. Useful for batch systems and data pipelines.
- Error rate: The proportion of requests returning errors. Often the inverse of availability, but can be measured separately for different error classes (5xx vs 4xx).
- Freshness: For data systems, how old the most recent data is. Measured as time since last successful update.
Measure SLIs at the point closest to the user. A load balancer’s success rate is a better SLI than an application’s internal health check because it captures network issues, TLS failures, and routing problems the application never sees.
Service Level Objectives (SLOs) are targets set on SLIs. An SLO says “99.9% of requests will succeed over a 30-day rolling window.” The SLO has three parts: the SLI (request success rate), the target (99.9%), and the measurement window (30 days).
Practical SLO examples:
# API service SLOs
Availability: 99.9% of requests return non-5xx responses (30-day rolling)
Latency: 95% of requests complete in under 200ms (30-day rolling)
Latency: 99% of requests complete in under 1000ms (30-day rolling)
# Data pipeline SLOs
Freshness: 99.5% of the time, data is less than 5 minutes old (30-day rolling)
Completeness: 99.9% of input records produce output records (per batch)Service Level Agreements (SLAs) are contractual commitments to external customers with financial penalties. An SLA is always less aggressive than the internal SLO. If your SLO is 99.9%, your SLA might be 99.5%. The gap between SLO and SLA is your safety margin. If you only have an SLO and no paying customers demanding contractual guarantees, you do not need an SLA.
Defining SLOs in Practice#
Start by asking: what do users care about? Not what your monitoring measures, but what makes users happy or angry. For a web API, users care that it responds quickly and correctly. For a batch pipeline, users care that the output data is complete and recent.
Use Prometheus to compute SLOs:
# Availability SLI: proportion of successful HTTP requests
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
# Latency SLI: proportion of requests under 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))Set up a Grafana dashboard that shows current SLO compliance and remaining error budget. The dashboard should answer two questions at a glance: are we meeting the SLO right now, and how much budget do we have left this period?
Error Budgets#
The error budget is the inverse of the SLO. A 99.9% availability SLO means you have a 0.1% error budget – your service is allowed to fail 0.1% of the time over the measurement window. In a 30-day month, that is roughly 43 minutes of total downtime.
Error budgets make the reliability vs velocity tradeoff concrete:
- Budget remaining: Ship features, deploy frequently, take risks.
- Budget nearly exhausted: Slow down releases, focus on reliability work, require extra review for changes.
- Budget exceeded: Freeze feature releases. All engineering effort goes to reliability improvements until the budget recovers.
Write an error budget policy that codifies these rules. Without a written policy, the error budget is just a number on a dashboard that no one acts on.
Example error budget policy:
Error Budget Policy for payment-api (SLO: 99.95% availability, 30-day window)
Budget > 50% remaining:
- Normal development velocity
- Standard change review process
Budget 20-50% remaining:
- All deployments require explicit SRE approval
- No experimental features deployed to production
- Post-incident reviews must be completed within 48 hours
Budget < 20% remaining:
- Feature freeze for this service
- All engineering effort directed to reliability improvements
- Daily standup focused on error budget recovery
Budget exhausted (SLO violated):
- Escalate to engineering leadership
- Full reliability sprint until SLO is met for 7 consecutive days
- Retrospective on budget consumption patternCalculate remaining error budget:
allowed_failures = (1 - SLO_target) * total_requests_in_window
consumed_failures = actual_failures_in_window
remaining_budget = allowed_failures - consumed_failures
budget_percentage = (remaining_budget / allowed_failures) * 100Toil Reduction#
Toil is operational work that is manual, repetitive, automatable, tactical, and scales linearly with service growth. Restarting a crashed pod is toil. Manually provisioning a database for each new team is toil. Writing the automation to handle these cases is not toil – it is engineering.
The SRE model sets a target: no more than 50% of an SRE’s time should be spent on toil. The other 50% goes to engineering work that permanently reduces toil or improves reliability.
Identifying toil requires tracking it. For one or two weeks, log every operational task:
- What you did
- How long it took
- Whether it was manual
- Whether a human decision was required
- How often it recurs
Sort by frequency multiplied by time per occurrence. The top items are your automation targets. Common high-toil activities:
- Certificate rotation
- Access provisioning and deprovisioning
- Capacity adjustments (scaling up before expected traffic)
- Log review for recurring known issues
- Configuration changes across multiple environments
- Deployment rollbacks triggered by the same class of failure
On-Call Practices#
Effective on-call is structured, bounded, and sustainable.
Rotation structure: Minimum two people on-call at any time – a primary and a secondary. One-week rotations with at least one week off between shifts. Shorter rotations (3-4 days) reduce burnout. Longer rotations build context but increase fatigue.
Alert quality: Every page must be actionable, urgent, and novel. If the on-call engineer wakes up, there must be something only a human can decide. Alerts that can be resolved by running a script should not page – the script should run automatically. Review alert volumes weekly. More than two pages per 12-hour shift means your alerts need tuning.
Escalation paths: Define what happens when the primary does not respond within 5 minutes. Define what happens when the problem is outside the on-call engineer’s expertise. Document the chain clearly and keep contact information current.
Handoff: On-call handoffs should include a summary of active incidents, recent changes, known fragile areas, and any manual interventions still pending. A 15-minute handoff meeting between the outgoing and incoming on-call engineers prevents dropped context.
Compensation: On-call work outside business hours must be compensated, either through time off in lieu, direct payment, or both. Uncompensated on-call leads to attrition.
Production Readiness Reviews#
Before a new service goes to production, a production readiness review (PRR) verifies it meets minimum reliability standards. The review is not a gate to slow teams down – it is a checklist to catch gaps early.
A PRR checklist should cover:
Observability: Does the service emit metrics, structured logs, and traces? Are there dashboards showing the golden signals (latency, traffic, errors, saturation)? Are SLOs defined and measured?
Alerting: Are alerts configured for SLO violations? Do alert runbooks exist? Has the on-call team been trained on the new service?
Failure modes: What happens when each dependency is unavailable? Has the service been tested with dependency failures? Are timeouts and circuit breakers configured?
Capacity: What are the resource requirements? Has load testing been performed? Are autoscaling policies in place?
Security: Are secrets managed through a vault or sealed secrets? Are network policies configured? Has the container image been scanned?
Deployment: Is the deployment automated? Can the service be rolled back? Is there a canary or staged rollout process?
Data: Is data backed up? What is the recovery point objective (RPO) and recovery time objective (RTO)? Has recovery been tested?
Run the PRR as a collaborative meeting, not a pass/fail exam. The goal is to identify gaps and create a plan to address them, not to block launches indefinitely.
Reliability vs Velocity#
SRE exists at the intersection of two legitimate needs: the business needs to ship features quickly, and users need the service to work. Error budgets resolve this tension by making both sides quantifiable.
When reliability is high and budget is plentiful, SRE should actively encourage faster shipping – more deploys per day, shorter code review cycles, bolder experiments. When reliability degrades and budget shrinks, SRE has the data to justify slowing down without resorting to subjective arguments about “feeling unsafe.”
The goal is not maximum reliability. The goal is the right level of reliability for each service, explicitly chosen and continuously measured.