Automating Operational Runbooks

The Manual-to-Automated Progression#

Not every runbook should be automated, and automation does not happen in a single jump. The progression builds confidence at each stage.

Level 0 – Tribal Knowledge: The procedure exists only in someone’s head. Invisible risk.

Level 1 – Documented Runbook: Step-by-step instructions a human follows, including commands, expected outputs, and decision points. Every runbook starts here.

Level 2 – Scripted Runbook: Manual steps encoded in a script that a human triggers and monitors. The script handles tedious parts; the human handles judgment calls.

Level 3 – Semi-Automated: Runs automatically when triggered but pauses at key decision points for human approval. The sweet spot for most operational procedures.

Level 4 – Fully Automated: End-to-end without human intervention. Appropriate only for well-understood, low-risk, high-frequency operations with comprehensive safety checks.

When to Automate vs. Keep Manual#

Automate When#

Frequency is high. Run more than once a week – automation pays for itself quickly.
Steps are deterministic. Every step has a clear input, action, and expected output.
Time sensitivity matters. Procedure must complete in minutes; human execution is a bottleneck.
Manual errors are common. Typos, missed steps, wrong order.
The procedure is stable. Has not changed significantly in 3 months.

Keep Manual When#

Complex judgment is required. Each execution requires evaluating novel conditions.
Blast radius is catastrophic and irreversible. Data deletion, production schema changes.
Frequency is very low. Run once a year – automation rots faster than it saves time.
The environment is unstable. Target systems change frequently; automation needs constant maintenance.

                     High Frequency    Low Frequency
                   +------------------+------------------+
High Risk          | Semi-automated   | Manual with      |
                   | (Level 3)        | detailed docs    |
                   +------------------+------------------+
Low Risk           | Fully automated  | Scripted         |
                   | (Level 4)        | (Level 2)        |
                   +------------------+------------------+

Automation Tools#

Rundeck#

Job scheduler and runbook automation platform with a web UI for executing pre-defined procedures with RBAC.

Best for: Centralizing operations where on-call engineers click “Run” on pre-built procedures. Good for transitioning from manual to automated because it wraps existing scripts with a UI, access controls, and logging.

# Rundeck job definition
- id: restart-service
  name: "Restart Kubernetes Deployment"
  sequence:
    commands:
      - description: "Pre-check: verify deployment is healthy"
        script: |
          kubectl rollout status deployment/${option.deployment} \
            -n ${option.namespace} --timeout=30s
      - description: "Execute rolling restart"
        script: |
          kubectl rollout restart deployment/${option.deployment} \
            -n ${option.namespace}
      - description: "Wait for rollout"
        script: |
          kubectl rollout status deployment/${option.deployment} \
            -n ${option.namespace} --timeout=300s

Strengths: Web UI, RBAC, audit logging, LDAP integration, approval workflows. Weaknesses: Java-based (resource-heavy), limited orchestration logic.

Ansible AWX#

Open-source upstream of Red Hat Ansible Automation Platform. Web UI, REST API, and RBAC around Ansible playbooks.

Best for: Teams already using Ansible who want to expose playbooks as self-service operations with inventory management.

- name: Rotate TLS certificates
  hosts: "{{ target_hosts }}"
  tasks:
    - name: Backup existing certificate
      copy:
        src: /etc/ssl/certs/service.crt
        dest: "/etc/ssl/backup/service.crt.{{ ansible_date_time.iso8601_basic }}"
        remote_src: true
    - name: Generate new certificate
      community.crypto.x509_certificate:
        path: /etc/ssl/certs/service.crt
        provider: ownca
        ownca_path: /etc/ssl/certs/ca.crt
        ownca_not_after: "+365d"
    - name: Verify new certificate is served
      uri:
        url: "https://{{ inventory_hostname }}:{{ service_port }}/health"
        validate_certs: true
      retries: 3
      delay: 5

Strengths: Massive module ecosystem, agentless, declarative and idempotent. Weaknesses: Slow at scale (SSH per host), AWX has significant infrastructure overhead.

StackStorm#

Event-driven automation platform connecting triggers (alerts, webhooks) to actions through rules and workflows.

Best for: Alert-driven remediation – “if this alert fires, run that procedure.”

# StackStorm rule: auto-remediate high memory
name: remediate_high_memory
trigger:
  type: prometheus.webhook
  parameters:
    alert_name: "PodMemoryHigh"
criteria:
  trigger.labels.severity:
    type: equals
    pattern: "warning"
action:
  ref: kubernetes.restart_pod
  parameters:
    namespace: "{{ trigger.labels.namespace }}"
    pod_name: "{{ trigger.labels.pod }}"
    approval_required: true

Strengths: Event-driven, large integration ecosystem, complex workflow support. Weaknesses: Complex setup, smaller community, requires dedicated infrastructure.

Custom Scripts#

For simple, targeted automation, a well-written Bash or Python script may be sufficient.

Strengths: Simple, no additional infrastructure, easy to version control. Weaknesses: No built-in RBAC or UI, audit trail requires external logging, no approval workflows without wrapping.

Safety Checks and Guardrails#

Automated runbooks execute faster than humans can intervene. A script with a bug causes more damage in 10 seconds than a human could in 10 minutes.

Pre-Execution#

Target validation. Confirm the target exists and is in the expected state.
Environment confirmation. Verify you are operating in the intended environment.
Dependency health. Check that monitoring, logging, and backup systems are available.
Concurrency guard. Ensure another instance is not already running against the same target.

During Execution#

Step validation. After each step, verify the expected outcome before proceeding.
Rate limiting. Process multiple targets in batches with pauses between them.
Timeout enforcement. Every step has a timeout. Hanging steps fail the runbook, not block it.

Post-Execution#

Health verification. Verify the system is healthy after completion.
Metric comparison. Compare key metrics against the pre-execution baseline.

Approval Workflows#

Semi-automated runbooks pause for human approval at critical points.

Pre-execution approval: Runbook waits for a human to approve before any action. Good for scheduled maintenance where an engineer reviews conditions first.

Mid-execution approval: Preparatory steps (gathering data, backups, pre-checks) run automatically, then the runbook pauses before the impactful step. The human sees pre-check results and decides.

Escalating approval: Low-risk steps execute automatically. If unexpected conditions arise or higher-risk action is needed, the runbook pauses for senior approval.

Audit Trails#

Every execution must produce an audit trail answering: who triggered it, when, what it did, and what the outcome was.

{
  "runbook": "cache-rebuild",
  "execution_id": "exec-2026-0222-143052",
  "triggered_by": "alert:CacheHitRateLow",
  "triggered_at": "2026-02-22T14:30:52Z",
  "approved_by": "oncall-sre@company.com",
  "target": "api-server/production",
  "steps_executed": [
    {"step": "pre-check", "status": "pass", "timestamp": "2026-02-22T14:31:16Z"},
    {"step": "cache-rebuild", "status": "pass", "timestamp": "2026-02-22T14:31:45Z"},
    {"step": "post-check", "status": "pass", "timestamp": "2026-02-22T14:32:48Z"}
  ],
  "outcome": "success",
  "baseline_metrics": {"cache_hit_rate": 0.45},
  "post_metrics": {"cache_hit_rate": 0.92}
}

Store audit records in a durable, append-only system. The automation system should not be able to modify or delete its own logs.

Agent Operational Notes#

Start at Level 2. Script the steps but keep human triggering. Move to Level 3 only after multiple successful scripted executions.
Never skip pre-checks. A pre-check that prevents one bad execution justifies its existence for every future run.
Make rollback the default. If any step fails or any post-check deviates from expected results, roll back rather than continue.
Log at the step level. “Steps 1-5 succeeded, step 6 skipped (condition not met), step 7 succeeded” is useful. “The runbook succeeded” is not.
Treat automation code as production code. Version control, code review, testing, and staged rollouts apply to runbook automation.