Toil Measurement and Reduction

What Toil Actually Is#

Toil is work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth. Not all operational work is toil. Capacity planning requires judgment. Postmortem analysis produces lasting improvements. Writing automation code is engineering. Toil is the opposite: it is the work that a machine could do but currently a human is doing, over and over, without making the system any better.

Google’s SRE book provides the canonical definition, but teams misapply it constantly. The most common mistake is labeling all ops work as toil. On-call is not inherently toil. Responding to a novel incident, diagnosing a complex failure, and writing a fix is engineering. Manually restarting a service for the third time this week because nobody has fixed the memory leak – that is toil.

Five properties define toil:

Manual: A human performs the work by hand, even if scripted. Running a script manually counts.
Repetitive: It happens more than once. One-time migration tasks are not toil.
Automatable: A machine could do this if someone built the automation.
Tactical: It is reactive, interrupt-driven, not strategic.
No enduring value: The service is not improved after the work. Restarting a crashed pod keeps things running but fixes nothing permanently.
Scales with service growth: More users, more requests, more manual work. If your manual certificate rotation scales with the number of services, that is toil.

Measuring Toil#

You cannot reduce what you do not measure. Toil measurement starts with a simple inventory and evolves into continuous tracking.

The Toil Survey#

Run a toil survey every quarter. Each team member logs their activities for one representative week, categorizing every task as engineering work or toil. Use a simple spreadsheet:

| Task                          | Category | Time (hrs) | Frequency  | Automatable? |
|-------------------------------|----------|------------|------------|--------------|
| Deploy staging environment    | Toil     | 1.5        | 3x/week    | Yes          |
| Rotate API keys manually      | Toil     | 0.5        | Monthly    | Yes          |
| Investigate false alert pages  | Toil     | 2.0        | 5x/week    | Partially    |
| Write capacity plan for Q3    | Eng      | 4.0        | Quarterly  | No           |
| Manually approve DB migrations | Toil     | 0.75       | 2x/week    | Yes          |
| Debug novel latency spike     | Eng      | 3.0        | 1x/week    | No           |

Calculating Toil Percentage#

Aggregate across the team:

Toil Percentage = (Total Toil Hours / Total Working Hours) x 100

Example:
  Team of 5 engineers, 40-hour weeks
  Total working hours: 200 hours/week
  Total toil hours: 90 hours/week
  Toil percentage: 45%

Toil Metrics to Track#

Build a Grafana dashboard with these signals:

# Prometheus metrics for toil tracking
- metric: toil_task_duration_seconds
  labels: [team, task_type, automatable]
  type: histogram
  description: "Time spent on identified toil tasks"

- metric: toil_task_count_total
  labels: [team, task_type]
  type: counter
  description: "Number of toil tasks performed"

- metric: toil_percentage_gauge
  labels: [team]
  type: gauge
  description: "Current toil percentage for team"

Track toil percentage per team per quarter. Plot the trend line. If it is going up, you are losing ground.

The Toil Budget#

Google targets a maximum of 50% toil for any SRE team. Below 50%, the team has enough engineering time to build automation that further reduces toil. Above 50%, the team enters a death spiral where all time goes to firefighting and manual work with no capacity to automate.

Set explicit toil budgets per team:

Toil Budget Policy:
- Target: < 33% of engineering time spent on toil
- Warning threshold: 40%
- Critical threshold: 50%
- If a team exceeds 50% toil for 2 consecutive quarters:
    - Freeze new feature work on their services
    - Allocate dedicated automation sprint
    - Escalate to engineering leadership for staffing review

The budget is a forcing function. When a team is near budget, they must either automate existing toil or push back on new manual processes being introduced.

Automation Prioritization Matrix#

Not all toil is equally worth automating. Use a 2x2 matrix of time saved versus automation effort:

                    Low Effort to Automate    High Effort to Automate
                    ─────────────────────     ──────────────────────
High Time Saved  │  DO FIRST (Quick Wins)   │  PLAN (Strategic Projects)
                 │  - Manual deployments     │  - Self-healing systems
                 │  - Certificate rotation   │  - Auto-capacity scaling
                 │  - Log cleanup scripts    │  - Intelligent alerting
                 │                           │
Low Time Saved   │  DO NEXT (Easy Wins)      │  SKIP (Low ROI)
                 │  - Status page updates    │  - Rare one-off tasks
                 │  - Report generation      │  - Complex edge cases
                 │  - Config file templating │  - Monthly manual reviews

Score each toil item with a concrete ROI calculation:

ROI Score = (Time_per_occurrence × Frequency_per_year × Years_of_value) / Automation_effort

Example: Manual deploy to staging
  Time per occurrence: 45 minutes (0.75 hours)
  Frequency: 150 times/year (3x/week × 50 weeks)
  Value horizon: 3 years
  Automation effort: 16 hours to build CI/CD pipeline

  ROI = (0.75 × 150 × 3) / 16 = 21.1x return

Example: Quarterly compliance report
  Time per occurrence: 4 hours
  Frequency: 4 times/year
  Value horizon: 3 years
  Automation effort: 40 hours

  ROI = (4 × 4 × 3) / 40 = 1.2x return

Anything above 5x ROI is a clear win. Between 2x and 5x, evaluate based on team pain and secondary benefits like reduced error rates. Below 2x, defer unless there are compliance or risk reasons.

Tracking Toil Reduction Over Time#

Measurement without follow-through is theater. Build toil reduction into your team’s regular cadence.

Quarterly Toil Review#

Agenda:
1. Review current toil percentage vs budget (10 min)
2. Review toil inventory changes since last quarter (15 min)
3. Status update on in-progress automation projects (10 min)
4. Identify new toil introduced since last review (10 min)
5. Prioritize next quarter's automation targets (15 min)

Outputs:
- Updated toil inventory spreadsheet
- Automation backlog for next quarter (top 3-5 items)
- Toil percentage trend chart
- Escalations if budget exceeded

Jira/Linear Integration#

Create a dedicated toil label or project. Every identified toil item becomes a ticket with:

Title: [TOIL] Manual certificate rotation for payment-api
Labels: toil, automation-candidate
Priority: Based on ROI score
Custom fields:
  - Time per occurrence: 30 min
  - Frequency: Monthly per service (12 services = 12x/month)
  - Annual time cost: 72 hours
  - Estimated automation effort: 8 hours
  - ROI score: 27x

Preventing New Toil#

The hardest part is not reducing existing toil but preventing new toil from being created. Require toil assessments for any new manual process:

Every new runbook must include an automation plan with a target date.
Any manual process expected to run more than 5 times must have a ticket to automate it.
Production readiness reviews must evaluate whether a new service introduces toil for the on-call team.

Track “toil created” alongside “toil eliminated” each quarter. If you are eliminating 10 hours of toil but creating 12, you are moving backward. The goal is a consistently downward trend in toil percentage over multiple quarters, giving your team more time for engineering work that makes the system genuinely better.