SLI, SLO, and SLA – What They Actually Mean#
An SLI (Service Level Indicator) is a quantitative measurement of service quality – a number computed from your metrics. Examples: the proportion of successful HTTP requests, the proportion of requests faster than 500ms, the proportion of jobs completing within their deadline.
An SLO (Service Level Objective) is a target value for an SLI. It is an internal engineering commitment: “99.9% of requests will succeed over a 30-day rolling window.”
An SLA (Service Level Agreement) is a business contract with consequences – typically service credits if the SLO is not met. SLAs are always less aggressive than internal SLOs. If your SLO is 99.9%, your SLA might be 99.5%, giving you a buffer before contractual obligations kick in.
Choosing SLIs#
Good SLIs are user-facing measurements. Internal metrics like CPU usage or queue depth are useful for debugging but poor SLIs because they do not directly represent user experience.
Availability: the ratio of successful requests to total requests. Define “successful” precisely – typically non-5xx, but exclude 429 (rate limiting is intentional).
Latency: the proportion of requests faster than a threshold. Never use average latency – it hides tail latency. Use a percentile-at-threshold: “99% of requests under 500ms.”
Freshness: for data pipelines, the age of the most recent successfully processed record.
Throughput: for batch systems, the proportion of jobs completing within their scheduled window.
The Error Budget#
If your SLO is 99.9% availability over 30 days, your error budget is 0.1%. In concrete terms:
30 days * 24 hours * 60 minutes = 43,200 minutes
0.1% of 43,200 = 43.2 minutes of allowed downtimeThe error budget reframes the reliability conversation. Instead of “should we deploy on Friday?” the question becomes “do we have budget remaining to absorb a potential incident?” When budget remains, deploy aggressively. When it is exhausted, focus on reliability.
Implementing Availability SLI in PromQL#
The fundamental availability SLI query:
# 30-day availability ratio for a service
sum(rate(http_requests_total{job="api", code!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))This works but is expensive to evaluate – it loads 30 days of raw data. In practice, you layer recording rules:
groups:
- name: sli_availability
interval: 30s
rules:
# Layer 1: short-window error ratio (used by burn-rate alerts)
- record: job:http_errors:ratio_rate5m
expr: |
sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:ratio_rate30m
expr: |
sum by (job) (rate(http_requests_total{code=~"5.."}[30m]))
/ sum by (job) (rate(http_requests_total[30m]))
- record: job:http_errors:ratio_rate1h
expr: |
sum by (job) (rate(http_requests_total{code=~"5.."}[1h]))
/ sum by (job) (rate(http_requests_total[1h]))
- record: job:http_errors:ratio_rate6h
expr: |
sum by (job) (rate(http_requests_total{code=~"5.."}[6h]))
/ sum by (job) (rate(http_requests_total[6h]))Implementing Latency SLI with Histograms#
For a latency SLI of “99% of requests complete in under 500ms”:
# Proportion of requests faster than 500ms over 30 days
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))The le="0.5" bucket contains all observations less than or equal to 500ms. Dividing by the total count gives the proportion within the threshold.
Critical requirement: your histogram must have a bucket boundary at or near your SLO threshold. If your buckets are [0.1, 0.25, 1.0, 5.0] and your SLO threshold is 500ms, there is no le="0.5" bucket. You would have to use le="1.0", which overstates compliance. Configure bucket boundaries to match your SLO thresholds.
Recording rules for latency SLI follow the same pattern as availability:
groups:
- name: sli_latency
interval: 30s
rules:
- record: job:http_latency_below_threshold:ratio_rate5m
expr: |
sum by (job) (rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/ sum by (job) (rate(http_request_duration_seconds_count[5m]))Multi-Window Multi-Burn-Rate Alerting#
Why Simple Threshold Alerts Fail for SLOs#
A naive alert like error_ratio > 0.001 (targeting 99.9%) fires on any momentary spike, even ones that consume negligible budget. Setting a for: 1h duration avoids noise but means you do not get paged until an hour into a major incident. You need alerts that are sensitive to severe incidents and tolerant of minor blips.
The Burn Rate Concept#
Burn rate is how fast you are consuming error budget relative to a uniform consumption rate. A burn rate of 1 means you are consuming budget at exactly the rate that would exhaust it at the end of the SLO window. A burn rate of 10 means you would exhaust the budget in 1/10th of the window.
For a 99.9% SLO (0.1% error budget) over 30 days:
Burn rate 1: 0.1% error rate -- budget exhausted in 30 days (this is your baseline)
Burn rate 2: 0.2% error rate -- budget exhausted in 15 days
Burn rate 10: 1.0% error rate -- budget exhausted in 3 days
Burn rate 14: 1.4% error rate -- budget exhausted in ~2 days
Burn rate 36: 3.6% error rate -- budget exhausted in ~20 hoursThe Four-Window Approach#
Google’s SRE workbook recommends four alert windows, combining a short window for detection speed and a long window for significance:
Page-worthy (immediate response required):
| Severity | Burn Rate | Long Window | Short Window | Budget Consumed |
|---|---|---|---|---|
| Critical | 14.4x | 1h | 5m | 2% in 1 hour |
| Critical | 6x | 6h | 30m | 5% in 6 hours |
Ticket-worthy (next business day):
| Severity | Burn Rate | Long Window | Short Window | Budget Consumed |
|---|---|---|---|---|
| Warning | 3x | 1d | 2h | 10% in 1 day |
| Warning | 1x | 3d | 6h | 10% in 3 days |
The short window prevents the alert from firing when the problem has already resolved. Both windows must exceed the threshold for the alert to fire.
Implementation as Alerting Rules#
groups:
- name: slo-burn-rate-alerts
rules:
# Page: 2% budget consumed in 1 hour
- alert: SLOHighBurnRate_Critical_1h
expr: |
job:http_errors:ratio_rate5m{job="api"} > (14.4 * 0.001)
and
job:http_errors:ratio_rate1h{job="api"} > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: "api-availability"
window: "1h"
annotations:
summary: "API error budget burning at 14.4x -- 2% consumed in 1 hour"
budget_consumed: "2%"
# Page: 5% budget consumed in 6 hours
- alert: SLOHighBurnRate_Critical_6h
expr: |
job:http_errors:ratio_rate30m{job="api"} > (6 * 0.001)
and
job:http_errors:ratio_rate6h{job="api"} > (6 * 0.001)
for: 5m
labels:
severity: critical
slo: "api-availability"
window: "6h"
annotations:
summary: "API error budget burning at 6x -- 5% consumed in 6 hours"
budget_consumed: "5%"
# Ticket: 10% budget consumed in 1 day
- alert: SLOHighBurnRate_Warning_1d
expr: |
job:http_errors:ratio_rate2h{job="api"} > (3 * 0.001)
and
job:http_errors:ratio_rate1d{job="api"} > (3 * 0.001)
for: 15m
labels:
severity: warning
slo: "api-availability"
window: "1d"
annotations:
summary: "API error budget burning at 3x -- 10% consumed in 1 day"
# Ticket: 10% budget consumed in 3 days
- alert: SLOHighBurnRate_Warning_3d
expr: |
job:http_errors:ratio_rate6h{job="api"} > (1 * 0.001)
and
job:http_errors:ratio_rate3d{job="api"} > (1 * 0.001)
for: 30m
labels:
severity: warning
slo: "api-availability"
window: "3d"
annotations:
summary: "API error budget burning at 1x -- will exhaust in 30 days at this rate"The 0.001 in each expression is the error budget (1 - 0.999). Multiply by the burn rate to get the threshold error ratio.
Error Budget Dashboard#
A Grafana dashboard for error budget tracking needs these panels:
Budget remaining gauge:
# Remaining error budget as a percentage (0-100)
(1 - (
sum(rate(http_requests_total{job="api", code=~"5.."}[30d]))
/ sum(rate(http_requests_total{job="api"}[30d]))
- (1 - 0.999)
) / 0.001) * 100Simplify with recording rules. Color thresholds: green above 50%, yellow 20-50%, red below 20%.
Current burn rate:
# Current burn rate (1 = sustainable, >1 = consuming budget too fast)
job:http_errors:ratio_rate1h{job="api"} / 0.001Time until budget exhaustion at current rate:
# Hours until budget is gone (negative means already exhausted)
(
0.001 - job:http_errors:ratio_rate1h{job="api"}
) / job:http_errors:ratio_rate1h{job="api"} * 720Where 720 = 30 days * 24 hours.
Budget consumption by error type (requires a label distinguishing error categories):
sum by (code) (rate(http_requests_total{job="api", code=~"5.."}[30d]))
/ sum(rate(http_requests_total{job="api"}[30d]))This reveals whether budget is consumed by 502s (upstream failures), 503s (overload), or 500s (application bugs).
Error Budget Policy#
Without a written policy, error budgets are just numbers on a dashboard. A practical policy:
Above 50%: Normal operations. Deploy freely. Run chaos experiments.
20-50%: Increased caution. Deployments require extra review. Investigate ongoing error sources.
Below 20%: Feature deployments paused unless they improve reliability. Post-incident reviews mandatory.
Exhausted: Feature freeze. Only bug fixes and reliability work. Freeze lifts when the rolling window recovers.
Multi-Tier Application Example#
For a 3-tier application (API gateway, worker service, PostgreSQL database), define SLOs per tier:
API gateway: 99.9% availability (non-5xx), 99% of requests under 500ms.
Worker service: 99.9% of jobs complete successfully, 99% of jobs complete within 60 seconds.
Database: 99.95% availability (connection success rate), 99% of queries under 100ms.
Each tier gets its own set of recording rules and burn-rate alerts. The API gateway SLO is the most user-facing and the most important – backend issues that do not cause API errors do not consume the API’s error budget.
groups:
- name: sli_api
rules:
- record: sli:api_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{job="api-gateway", code!~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api-gateway"}[5m]))
- name: sli_worker
rules:
- record: sli:worker_success:ratio_rate5m
expr: |
sum(rate(jobs_completed_total{job="worker", status="success"}[5m]))
/ sum(rate(jobs_completed_total{job="worker"}[5m]))
- name: sli_database
rules:
- record: sli:db_availability:ratio_rate5m
expr: |
sum(rate(pg_connections_total{job="postgres", status="success"}[5m]))
/ sum(rate(pg_connections_total{job="postgres"}[5m]))Pyrra and Sloth#
Writing recording rules and burn-rate alerts by hand is tedious and error-prone. Two tools automate this.
Sloth takes an SLO definition in YAML and generates all the recording rules and multi-window burn-rate alerts:
# sloth.yml
version: "prometheus/v1"
service: "api-gateway"
labels:
team: "platform"
slos:
- name: "requests-availability"
objective: 99.9
description: "API availability"
sli:
events:
error_query: sum(rate(http_requests_total{job="api-gateway", code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total{job="api-gateway"}[{{.window}}]))
alerting:
name: APIHighErrorRate
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warningRun sloth generate -i sloth.yml and it outputs a complete PrometheusRule with all the recording rules and four-window burn-rate alerts.
Pyrra provides similar functionality but also includes a web UI that displays SLO compliance, error budget status, and burn rate. It runs as a Kubernetes operator that watches SLO custom resources and generates PrometheusRule resources automatically.
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: api-availability
namespace: monitoring
spec:
target: "99.9"
window: 30d
indicator:
ratio:
errors:
metric: http_requests_total{job="api-gateway", code=~"5.."}
total:
metric: http_requests_total{job="api-gateway"}Both tools follow the Google SRE multi-window multi-burn-rate approach. Sloth is simpler (CLI tool, generates YAML). Pyrra is more integrated (Kubernetes operator, web dashboard). Either eliminates manual work and reduces miscalculated thresholds.
Common Pitfalls#
SLOs too tight: A 99.99% SLO gives 4.3 minutes of budget per month. A single rollback can consume half of it. Match your SLO to your deployment pipeline’s capabilities.
Measuring the wrong thing: Status-code-only SLIs miss slow-but-successful requests. Combine availability and latency SLIs. Always exclude health check endpoints from SLI calculations.
Ignoring partial failures: A 200 response with empty results or stale data looks healthy to a status-code SLI. Use application-level success signals when possible.
No error budget policy: Without documented, agreed-upon consequences, budgets are ignored when inconvenient. Get leadership buy-in before the budget runs out.
Calendar vs rolling window: Calendar-month SLOs reset on the first, creating perverse incentives. A 30-day rolling window provides consistent pressure and is strongly preferred.