From Theory to Running SLOs#
Every SRE resource explains what SLOs are. Few explain how to actually implement them from scratch – the Prometheus queries, the error budget math, the alerting rules, and the conversations with product managers when the budget runs out. This guide covers all of it.
Step 1: Choose Your SLIs#
SLIs must measure what users experience. Internal metrics like CPU usage or queue depth are useful for debugging but are not SLIs because users do not care about your CPU – they care whether the page loaded.
The Four SLI Types#
Availability: Did the request succeed?
# Availability SLI: ratio of successful requests
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))Latency: Was the request fast enough?
# Latency SLI: ratio of requests under 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))Correctness: Did the response contain the right data? Harder to measure – often requires application-level probes or synthetic checks that verify response content.
Freshness: Is the data recent enough? Critical for data pipelines and caches.
# Freshness SLI: time since last successful pipeline run
time() - pipeline_last_success_timestamp_secondsMeasure SLIs at the edge, not the origin. A load balancer’s view captures network failures, TLS issues, and routing errors your application never sees. If you must measure at the application, ensure you also capture connection-level failures.
Step 2: Set SLO Targets#
SLO targets are not aspirational. They represent the level of reliability users actually need. Start with historical data.
# Pull 90 days of availability data
Query: avg_over_time(
(sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h])))[90d:1h])
Result: 99.95% historical availabilitySet your initial SLO slightly below your historical performance. If you have been running at 99.95%, set 99.9%. This gives you headroom and makes the SLO achievable from day one. You can tighten it later.
Common SLO targets by service type:
| Service Type | Availability | Latency (p99) |
|-----------------------|-------------|------------------|
| User-facing API | 99.9% | < 500ms |
| Internal API | 99.5% | < 1000ms |
| Data pipeline | 99.5% | Freshness < 5min |
| Batch processing | 99.0% | Completion < 4hr |
| Static content/CDN | 99.95% | < 100ms |Step 3: Calculate Error Budgets#
The error budget is the amount of unreliability your SLO permits over a given window.
Error Budget = 1 - SLO target
For a 99.9% SLO over 30 days:
Error budget = 0.1% = 0.001
Total minutes in 30 days: 43,200
Allowed downtime: 43,200 × 0.001 = 43.2 minutes
For a 99.5% SLO over 30 days:
Error budget = 0.5% = 0.005
Allowed downtime: 43,200 × 0.005 = 216 minutes (3.6 hours)Track error budget consumption as a percentage:
# Error budget remaining (Prometheus recording rule)
- record: slo:error_budget_remaining:ratio
expr: |
1 - (
(1 - (sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))))
/ (1 - 0.999)
)When slo:error_budget_remaining:ratio hits 0, you have consumed your entire error budget for the window.
Step 4: Define Error Budget Policies#
The error budget policy is what makes SLOs operational. Without a policy, the error budget is just a number on a dashboard that nobody acts on.
## Error Budget Policy: payment-api
**SLO**: 99.9% availability, 30-day rolling window
### Budget > 50% remaining
- Normal development velocity
- Feature work proceeds as planned
- Standard deployment cadence
### Budget 25-50% remaining
- Prioritize reliability work in next sprint
- Increase deployment testing (canary duration from 10min to 30min)
- Review recent incidents for systemic issues
### Budget 5-25% remaining
- Freeze non-critical feature deployments
- All engineering effort shifts to reliability
- Daily error budget review in standup
### Budget < 5% remaining (or exhausted)
- Complete feature freeze
- All deployments require SRE approval
- Incident review for every error budget consumption event
- Escalate to engineering leadership
### Budget exhausted
- Postmortem required identifying systemic causes
- Reliability sprint: minimum 2 weeks focused on fixes
- Feature freeze remains until budget recovers above 25%The policy must have teeth. If product management can override a feature freeze whenever they want, the error budget policy is fiction.
Step 5: SLO-Based Alerting with Burn Rates#
Threshold-based alerts are noisy. “Error rate > 1%” fires on a brief spike that consumes negligible budget. Burn rate alerting solves this by asking: “At the current rate of errors, when will we exhaust the error budget?”
Burn rate = (actual error rate) / (SLO-permitted error rate)
For a 99.9% SLO:
Permitted error rate = 0.1%
If current error rate = 0.5%
Burn rate = 0.5% / 0.1% = 5x
At 5x burn rate, a 30-day budget is consumed in 6 days.Implement multi-window burn rate alerts (Google’s recommended approach):
# Prometheus alerting rules for SLO burn rate
groups:
- name: slo-burn-rate
rules:
# Fast burn: 14.4x over 1 hour (exhausts budget in ~2 days)
# Short window for confirmation: 5 minutes
- alert: SLOHighBurnRate_Critical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
labels:
severity: critical
annotations:
summary: "High SLO burn rate - budget exhausted in ~2 days"
# Slow burn: 3x over 6 hours (exhausts budget in ~10 days)
- alert: SLOHighBurnRate_Warning
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > (3 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
) > (3 * 0.001)
labels:
severity: warning
annotations:
summary: "Elevated SLO burn rate - budget exhausted in ~10 days"The long window catches sustained problems. The short window prevents alerting on issues that have already resolved. This dual-window approach dramatically reduces false positives compared to single-threshold alerts.
Step 6: Stakeholder Communication#
SLOs are useless if only the engineering team knows about them. Product managers, executives, and customer-facing teams need to understand what SLOs mean and how error budgets affect planning.
Weekly SLO Report#
## SLO Status Report - Week of 2026-02-17
| Service | SLO Target | Current | Budget Remaining | Trend |
|-------------|-----------|----------|-----------------|--------|
| payment-api | 99.9% | 99.92% | 62% | Stable |
| search-api | 99.5% | 99.1% | 18% | Down |
| auth-service | 99.9% | 99.97% | 88% | Stable |
### Action Items
- search-api: Error budget below 25%. Reliability sprint started.
Root cause: connection pool exhaustion under peak load (JIRA-4601).
Feature deployments paused until budget recovers above 50%.Frame error budgets as a shared resource. Product managers should think of error budget like a spending account: deploying a risky feature costs some budget. A planned maintenance window costs some budget. Running the budget to zero means no more risk-taking until it recovers. This turns reliability from an abstract concern into a concrete resource that competes fairly with feature work.