SLOs, Error Budgets, and SLI Implementation with Prometheus

SLI, SLO, and SLA – What They Actually Mean#

An SLI (Service Level Indicator) is a quantitative measurement of service quality – a number computed from your metrics. Examples: the proportion of successful HTTP requests, the proportion of requests faster than 500ms, the proportion of jobs completing within their deadline.

An SLO (Service Level Objective) is a target value for an SLI. It is an internal engineering commitment: “99.9% of requests will succeed over a 30-day rolling window.”

An SLA (Service Level Agreement) is a business contract with consequences – typically service credits if the SLO is not met. SLAs are always less aggressive than internal SLOs. If your SLO is 99.9%, your SLA might be 99.5%, giving you a buffer before contractual obligations kick in.

Choosing SLIs#

Good SLIs are user-facing measurements. Internal metrics like CPU usage or queue depth are useful for debugging but poor SLIs because they do not directly represent user experience.

Availability: the ratio of successful requests to total requests. Define “successful” precisely – typically non-5xx, but exclude 429 (rate limiting is intentional).

Latency: the proportion of requests faster than a threshold. Never use average latency – it hides tail latency. Use a percentile-at-threshold: “99% of requests under 500ms.”

Freshness: for data pipelines, the age of the most recent successfully processed record.

Throughput: for batch systems, the proportion of jobs completing within their scheduled window.

The Error Budget#

If your SLO is 99.9% availability over 30 days, your error budget is 0.1%. In concrete terms:

30 days * 24 hours * 60 minutes = 43,200 minutes
0.1% of 43,200 = 43.2 minutes of allowed downtime

The error budget reframes the reliability conversation. Instead of “should we deploy on Friday?” the question becomes “do we have budget remaining to absorb a potential incident?” When budget remains, deploy aggressively. When it is exhausted, focus on reliability.

Implementing Availability SLI in PromQL#

The fundamental availability SLI query:

# 30-day availability ratio for a service
sum(rate(http_requests_total{job="api", code!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))

This works but is expensive to evaluate – it loads 30 days of raw data. In practice, you layer recording rules:

groups:
  - name: sli_availability
    interval: 30s
    rules:
      # Layer 1: short-window error ratio (used by burn-rate alerts)
      - record: job:http_errors:ratio_rate5m
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:ratio_rate30m
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[30m]))
          / sum by (job) (rate(http_requests_total[30m]))

      - record: job:http_errors:ratio_rate1h
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[1h]))
          / sum by (job) (rate(http_requests_total[1h]))

      - record: job:http_errors:ratio_rate6h
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[6h]))
          / sum by (job) (rate(http_requests_total[6h]))

Implementing Latency SLI with Histograms#

For a latency SLI of “99% of requests complete in under 500ms”:

# Proportion of requests faster than 500ms over 30 days
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

The le="0.5" bucket contains all observations less than or equal to 500ms. Dividing by the total count gives the proportion within the threshold.

Critical requirement: your histogram must have a bucket boundary at or near your SLO threshold. If your buckets are [0.1, 0.25, 1.0, 5.0] and your SLO threshold is 500ms, there is no le="0.5" bucket. You would have to use le="1.0", which overstates compliance. Configure bucket boundaries to match your SLO thresholds.

Recording rules for latency SLI follow the same pattern as availability:

groups:
  - name: sli_latency
    interval: 30s
    rules:
      - record: job:http_latency_below_threshold:ratio_rate5m
        expr: |
          sum by (job) (rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
          / sum by (job) (rate(http_request_duration_seconds_count[5m]))

Multi-Window Multi-Burn-Rate Alerting#

Why Simple Threshold Alerts Fail for SLOs#

A naive alert like error_ratio > 0.001 (targeting 99.9%) fires on any momentary spike, even ones that consume negligible budget. Setting a for: 1h duration avoids noise but means you do not get paged until an hour into a major incident. You need alerts that are sensitive to severe incidents and tolerant of minor blips.

The Burn Rate Concept#

Burn rate is how fast you are consuming error budget relative to a uniform consumption rate. A burn rate of 1 means you are consuming budget at exactly the rate that would exhaust it at the end of the SLO window. A burn rate of 10 means you would exhaust the budget in 1/10th of the window.

For a 99.9% SLO (0.1% error budget) over 30 days:

Burn rate 1:  0.1% error rate -- budget exhausted in 30 days (this is your baseline)
Burn rate 2:  0.2% error rate -- budget exhausted in 15 days
Burn rate 10: 1.0% error rate -- budget exhausted in 3 days
Burn rate 14: 1.4% error rate -- budget exhausted in ~2 days
Burn rate 36: 3.6% error rate -- budget exhausted in ~20 hours

The Four-Window Approach#

Google’s SRE workbook recommends four alert windows, combining a short window for detection speed and a long window for significance:

Page-worthy (immediate response required):

Severity	Burn Rate	Long Window	Short Window	Budget Consumed
Critical	14.4x	1h	5m	2% in 1 hour
Critical	6x	6h	30m	5% in 6 hours

Ticket-worthy (next business day):

Severity	Burn Rate	Long Window	Short Window	Budget Consumed
Warning	3x	1d	2h	10% in 1 day
Warning	1x	3d	6h	10% in 3 days

The short window prevents the alert from firing when the problem has already resolved. Both windows must exceed the threshold for the alert to fire.

Implementation as Alerting Rules#

groups:
  - name: slo-burn-rate-alerts
    rules:
      # Page: 2% budget consumed in 1 hour
      - alert: SLOHighBurnRate_Critical_1h
        expr: |
          job:http_errors:ratio_rate5m{job="api"} > (14.4 * 0.001)
          and
          job:http_errors:ratio_rate1h{job="api"} > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: "api-availability"
          window: "1h"
        annotations:
          summary: "API error budget burning at 14.4x -- 2% consumed in 1 hour"
          budget_consumed: "2%"

      # Page: 5% budget consumed in 6 hours
      - alert: SLOHighBurnRate_Critical_6h
        expr: |
          job:http_errors:ratio_rate30m{job="api"} > (6 * 0.001)
          and
          job:http_errors:ratio_rate6h{job="api"} > (6 * 0.001)
        for: 5m
        labels:
          severity: critical
          slo: "api-availability"
          window: "6h"
        annotations:
          summary: "API error budget burning at 6x -- 5% consumed in 6 hours"
          budget_consumed: "5%"

      # Ticket: 10% budget consumed in 1 day
      - alert: SLOHighBurnRate_Warning_1d
        expr: |
          job:http_errors:ratio_rate2h{job="api"} > (3 * 0.001)
          and
          job:http_errors:ratio_rate1d{job="api"} > (3 * 0.001)
        for: 15m
        labels:
          severity: warning
          slo: "api-availability"
          window: "1d"
        annotations:
          summary: "API error budget burning at 3x -- 10% consumed in 1 day"

      # Ticket: 10% budget consumed in 3 days
      - alert: SLOHighBurnRate_Warning_3d
        expr: |
          job:http_errors:ratio_rate6h{job="api"} > (1 * 0.001)
          and
          job:http_errors:ratio_rate3d{job="api"} > (1 * 0.001)
        for: 30m
        labels:
          severity: warning
          slo: "api-availability"
          window: "3d"
        annotations:
          summary: "API error budget burning at 1x -- will exhaust in 30 days at this rate"

The 0.001 in each expression is the error budget (1 - 0.999). Multiply by the burn rate to get the threshold error ratio.

Error Budget Dashboard#

A Grafana dashboard for error budget tracking needs these panels:

Budget remaining gauge:

# Remaining error budget as a percentage (0-100)
(1 - (
  sum(rate(http_requests_total{job="api", code=~"5.."}[30d]))
  / sum(rate(http_requests_total{job="api"}[30d]))
  - (1 - 0.999)
) / 0.001) * 100

Simplify with recording rules. Color thresholds: green above 50%, yellow 20-50%, red below 20%.

Current burn rate:

# Current burn rate (1 = sustainable, >1 = consuming budget too fast)
job:http_errors:ratio_rate1h{job="api"} / 0.001

Time until budget exhaustion at current rate:

# Hours until budget is gone (negative means already exhausted)
(
  0.001 - job:http_errors:ratio_rate1h{job="api"}
) / job:http_errors:ratio_rate1h{job="api"} * 720

Where 720 = 30 days * 24 hours.

Budget consumption by error type (requires a label distinguishing error categories):

sum by (code) (rate(http_requests_total{job="api", code=~"5.."}[30d]))
/ sum(rate(http_requests_total{job="api"}[30d]))

This reveals whether budget is consumed by 502s (upstream failures), 503s (overload), or 500s (application bugs).

Error Budget Policy#

Without a written policy, error budgets are just numbers on a dashboard. A practical policy:

Above 50%: Normal operations. Deploy freely. Run chaos experiments.

20-50%: Increased caution. Deployments require extra review. Investigate ongoing error sources.

Below 20%: Feature deployments paused unless they improve reliability. Post-incident reviews mandatory.

Exhausted: Feature freeze. Only bug fixes and reliability work. Freeze lifts when the rolling window recovers.

Multi-Tier Application Example#

For a 3-tier application (API gateway, worker service, PostgreSQL database), define SLOs per tier:

API gateway: 99.9% availability (non-5xx), 99% of requests under 500ms.

Worker service: 99.9% of jobs complete successfully, 99% of jobs complete within 60 seconds.

Database: 99.95% availability (connection success rate), 99% of queries under 100ms.

Each tier gets its own set of recording rules and burn-rate alerts. The API gateway SLO is the most user-facing and the most important – backend issues that do not cause API errors do not consume the API’s error budget.

groups:
  - name: sli_api
    rules:
      - record: sli:api_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{job="api-gateway", code!~"5.."}[5m]))
          / sum(rate(http_requests_total{job="api-gateway"}[5m]))

  - name: sli_worker
    rules:
      - record: sli:worker_success:ratio_rate5m
        expr: |
          sum(rate(jobs_completed_total{job="worker", status="success"}[5m]))
          / sum(rate(jobs_completed_total{job="worker"}[5m]))

  - name: sli_database
    rules:
      - record: sli:db_availability:ratio_rate5m
        expr: |
          sum(rate(pg_connections_total{job="postgres", status="success"}[5m]))
          / sum(rate(pg_connections_total{job="postgres"}[5m]))

Pyrra and Sloth#

Writing recording rules and burn-rate alerts by hand is tedious and error-prone. Two tools automate this.

Sloth takes an SLO definition in YAML and generates all the recording rules and multi-window burn-rate alerts:

# sloth.yml
version: "prometheus/v1"
service: "api-gateway"
labels:
  team: "platform"
slos:
  - name: "requests-availability"
    objective: 99.9
    description: "API availability"
    sli:
      events:
        error_query: sum(rate(http_requests_total{job="api-gateway", code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="api-gateway"}[{{.window}}]))
    alerting:
      name: APIHighErrorRate
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Run sloth generate -i sloth.yml and it outputs a complete PrometheusRule with all the recording rules and four-window burn-rate alerts.

Pyrra provides similar functionality but also includes a web UI that displays SLO compliance, error budget status, and burn rate. It runs as a Kubernetes operator that watches SLO custom resources and generates PrometheusRule resources automatically.

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: api-availability
  namespace: monitoring
spec:
  target: "99.9"
  window: 30d
  indicator:
    ratio:
      errors:
        metric: http_requests_total{job="api-gateway", code=~"5.."}
      total:
        metric: http_requests_total{job="api-gateway"}

Both tools follow the Google SRE multi-window multi-burn-rate approach. Sloth is simpler (CLI tool, generates YAML). Pyrra is more integrated (Kubernetes operator, web dashboard). Either eliminates manual work and reduces miscalculated thresholds.

Common Pitfalls#

SLOs too tight: A 99.99% SLO gives 4.3 minutes of budget per month. A single rollback can consume half of it. Match your SLO to your deployment pipeline’s capabilities.

Measuring the wrong thing: Status-code-only SLIs miss slow-but-successful requests. Combine availability and latency SLIs. Always exclude health check endpoints from SLI calculations.

Ignoring partial failures: A 200 response with empty results or stale data looks healthy to a status-code SLI. Use application-level success signals when possible.

No error budget policy: Without documented, agreed-upon consequences, budgets are ignored when inconvenient. Get leadership buy-in before the budget runs out.

Calendar vs rolling window: Calendar-month SLOs reset on the first, creating perverse incentives. A 30-day rolling window provides consistent pressure and is strongly preferred.