From Theory to Running SLOs#

Every SRE resource explains what SLOs are. Few explain how to actually implement them from scratch – the Prometheus queries, the error budget math, the alerting rules, and the conversations with product managers when the budget runs out. This guide covers all of it.

Step 1: Choose Your SLIs#

SLIs must measure what users experience. Internal metrics like CPU usage or queue depth are useful for debugging but are not SLIs because users do not care about your CPU – they care whether the page loaded.

The Four SLI Types#

Availability: Did the request succeed?

# Availability SLI: ratio of successful requests
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Latency: Was the request fast enough?

# Latency SLI: ratio of requests under 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Correctness: Did the response contain the right data? Harder to measure – often requires application-level probes or synthetic checks that verify response content.

Freshness: Is the data recent enough? Critical for data pipelines and caches.

# Freshness SLI: time since last successful pipeline run
time() - pipeline_last_success_timestamp_seconds

Measure SLIs at the edge, not the origin. A load balancer’s view captures network failures, TLS issues, and routing errors your application never sees. If you must measure at the application, ensure you also capture connection-level failures.

Step 2: Set SLO Targets#

SLO targets are not aspirational. They represent the level of reliability users actually need. Start with historical data.

# Pull 90 days of availability data
Query: avg_over_time(
  (sum(rate(http_requests_total{status!~"5.."}[1h]))
   / sum(rate(http_requests_total[1h])))[90d:1h])

Result: 99.95% historical availability

Set your initial SLO slightly below your historical performance. If you have been running at 99.95%, set 99.9%. This gives you headroom and makes the SLO achievable from day one. You can tighten it later.

Common SLO targets by service type:

| Service Type          | Availability | Latency (p99)    |
|-----------------------|-------------|------------------|
| User-facing API       | 99.9%       | < 500ms          |
| Internal API          | 99.5%       | < 1000ms         |
| Data pipeline         | 99.5%       | Freshness < 5min |
| Batch processing      | 99.0%       | Completion < 4hr |
| Static content/CDN    | 99.95%      | < 100ms          |

Step 3: Calculate Error Budgets#

The error budget is the amount of unreliability your SLO permits over a given window.

Error Budget = 1 - SLO target

For a 99.9% SLO over 30 days:
  Error budget = 0.1% = 0.001
  Total minutes in 30 days: 43,200
  Allowed downtime: 43,200 × 0.001 = 43.2 minutes

For a 99.5% SLO over 30 days:
  Error budget = 0.5% = 0.005
  Allowed downtime: 43,200 × 0.005 = 216 minutes (3.6 hours)

Track error budget consumption as a percentage:

# Error budget remaining (Prometheus recording rule)
- record: slo:error_budget_remaining:ratio
  expr: |
    1 - (
      (1 - (sum(rate(http_requests_total{status!~"5.."}[30d]))
            / sum(rate(http_requests_total[30d]))))
      / (1 - 0.999)
    )

When slo:error_budget_remaining:ratio hits 0, you have consumed your entire error budget for the window.

Step 4: Define Error Budget Policies#

The error budget policy is what makes SLOs operational. Without a policy, the error budget is just a number on a dashboard that nobody acts on.

## Error Budget Policy: payment-api

**SLO**: 99.9% availability, 30-day rolling window

### Budget > 50% remaining
- Normal development velocity
- Feature work proceeds as planned
- Standard deployment cadence

### Budget 25-50% remaining
- Prioritize reliability work in next sprint
- Increase deployment testing (canary duration from 10min to 30min)
- Review recent incidents for systemic issues

### Budget 5-25% remaining
- Freeze non-critical feature deployments
- All engineering effort shifts to reliability
- Daily error budget review in standup

### Budget < 5% remaining (or exhausted)
- Complete feature freeze
- All deployments require SRE approval
- Incident review for every error budget consumption event
- Escalate to engineering leadership

### Budget exhausted
- Postmortem required identifying systemic causes
- Reliability sprint: minimum 2 weeks focused on fixes
- Feature freeze remains until budget recovers above 25%

The policy must have teeth. If product management can override a feature freeze whenever they want, the error budget policy is fiction.

Step 5: SLO-Based Alerting with Burn Rates#

Threshold-based alerts are noisy. “Error rate > 1%” fires on a brief spike that consumes negligible budget. Burn rate alerting solves this by asking: “At the current rate of errors, when will we exhaust the error budget?”

Burn rate = (actual error rate) / (SLO-permitted error rate)

For a 99.9% SLO:
  Permitted error rate = 0.1%
  If current error rate = 0.5%
  Burn rate = 0.5% / 0.1% = 5x

  At 5x burn rate, a 30-day budget is consumed in 6 days.

Implement multi-window burn rate alerts (Google’s recommended approach):

# Prometheus alerting rules for SLO burn rate
groups:
  - name: slo-burn-rate
    rules:
      # Fast burn: 14.4x over 1 hour (exhausts budget in ~2 days)
      # Short window for confirmation: 5 minutes
      - alert: SLOHighBurnRate_Critical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
        labels:
          severity: critical
        annotations:
          summary: "High SLO burn rate - budget exhausted in ~2 days"

      # Slow burn: 3x over 6 hours (exhausts budget in ~10 days)
      - alert: SLOHighBurnRate_Warning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            / sum(rate(http_requests_total[30m]))
          ) > (3 * 0.001)
        labels:
          severity: warning
        annotations:
          summary: "Elevated SLO burn rate - budget exhausted in ~10 days"

The long window catches sustained problems. The short window prevents alerting on issues that have already resolved. This dual-window approach dramatically reduces false positives compared to single-threshold alerts.

Step 6: Stakeholder Communication#

SLOs are useless if only the engineering team knows about them. Product managers, executives, and customer-facing teams need to understand what SLOs mean and how error budgets affect planning.

Weekly SLO Report#

## SLO Status Report - Week of 2026-02-17

| Service      | SLO Target | Current  | Budget Remaining | Trend  |
|-------------|-----------|----------|-----------------|--------|
| payment-api  | 99.9%     | 99.92%   | 62%             | Stable |
| search-api   | 99.5%     | 99.1%    | 18%             | Down   |
| auth-service | 99.9%     | 99.97%   | 88%             | Stable |

### Action Items
- search-api: Error budget below 25%. Reliability sprint started.
  Root cause: connection pool exhaustion under peak load (JIRA-4601).
  Feature deployments paused until budget recovers above 50%.

Frame error budgets as a shared resource. Product managers should think of error budget like a spending account: deploying a risky feature costs some budget. A planned maintenance window costs some budget. Running the budget to zero means no more risk-taking until it recovers. This turns reliability from an abstract concern into a concrete resource that competes fairly with feature work.