The Manual-to-Automated Progression#
Not every runbook should be automated, and automation does not happen in a single jump. The progression builds confidence at each stage.
Level 0 – Tribal Knowledge: The procedure exists only in someone’s head. Invisible risk.
Level 1 – Documented Runbook: Step-by-step instructions a human follows, including commands, expected outputs, and decision points. Every runbook starts here.
Level 2 – Scripted Runbook: Manual steps encoded in a script that a human triggers and monitors. The script handles tedious parts; the human handles judgment calls.
Level 3 – Semi-Automated: Runs automatically when triggered but pauses at key decision points for human approval. The sweet spot for most operational procedures.
Level 4 – Fully Automated: End-to-end without human intervention. Appropriate only for well-understood, low-risk, high-frequency operations with comprehensive safety checks.
When to Automate vs. Keep Manual#
Automate When#
- Frequency is high. Run more than once a week – automation pays for itself quickly.
- Steps are deterministic. Every step has a clear input, action, and expected output.
- Time sensitivity matters. Procedure must complete in minutes; human execution is a bottleneck.
- Manual errors are common. Typos, missed steps, wrong order.
- The procedure is stable. Has not changed significantly in 3 months.
Keep Manual When#
- Complex judgment is required. Each execution requires evaluating novel conditions.
- Blast radius is catastrophic and irreversible. Data deletion, production schema changes.
- Frequency is very low. Run once a year – automation rots faster than it saves time.
- The environment is unstable. Target systems change frequently; automation needs constant maintenance.
High Frequency Low Frequency
+------------------+------------------+
High Risk | Semi-automated | Manual with |
| (Level 3) | detailed docs |
+------------------+------------------+
Low Risk | Fully automated | Scripted |
| (Level 4) | (Level 2) |
+------------------+------------------+Automation Tools#
Rundeck#
Job scheduler and runbook automation platform with a web UI for executing pre-defined procedures with RBAC.
Best for: Centralizing operations where on-call engineers click “Run” on pre-built procedures. Good for transitioning from manual to automated because it wraps existing scripts with a UI, access controls, and logging.
# Rundeck job definition
- id: restart-service
name: "Restart Kubernetes Deployment"
sequence:
commands:
- description: "Pre-check: verify deployment is healthy"
script: |
kubectl rollout status deployment/${option.deployment} \
-n ${option.namespace} --timeout=30s
- description: "Execute rolling restart"
script: |
kubectl rollout restart deployment/${option.deployment} \
-n ${option.namespace}
- description: "Wait for rollout"
script: |
kubectl rollout status deployment/${option.deployment} \
-n ${option.namespace} --timeout=300sStrengths: Web UI, RBAC, audit logging, LDAP integration, approval workflows. Weaknesses: Java-based (resource-heavy), limited orchestration logic.
Ansible AWX#
Open-source upstream of Red Hat Ansible Automation Platform. Web UI, REST API, and RBAC around Ansible playbooks.
Best for: Teams already using Ansible who want to expose playbooks as self-service operations with inventory management.
- name: Rotate TLS certificates
hosts: "{{ target_hosts }}"
tasks:
- name: Backup existing certificate
copy:
src: /etc/ssl/certs/service.crt
dest: "/etc/ssl/backup/service.crt.{{ ansible_date_time.iso8601_basic }}"
remote_src: true
- name: Generate new certificate
community.crypto.x509_certificate:
path: /etc/ssl/certs/service.crt
provider: ownca
ownca_path: /etc/ssl/certs/ca.crt
ownca_not_after: "+365d"
- name: Verify new certificate is served
uri:
url: "https://{{ inventory_hostname }}:{{ service_port }}/health"
validate_certs: true
retries: 3
delay: 5Strengths: Massive module ecosystem, agentless, declarative and idempotent. Weaknesses: Slow at scale (SSH per host), AWX has significant infrastructure overhead.
StackStorm#
Event-driven automation platform connecting triggers (alerts, webhooks) to actions through rules and workflows.
Best for: Alert-driven remediation – “if this alert fires, run that procedure.”
# StackStorm rule: auto-remediate high memory
name: remediate_high_memory
trigger:
type: prometheus.webhook
parameters:
alert_name: "PodMemoryHigh"
criteria:
trigger.labels.severity:
type: equals
pattern: "warning"
action:
ref: kubernetes.restart_pod
parameters:
namespace: "{{ trigger.labels.namespace }}"
pod_name: "{{ trigger.labels.pod }}"
approval_required: trueStrengths: Event-driven, large integration ecosystem, complex workflow support. Weaknesses: Complex setup, smaller community, requires dedicated infrastructure.
Custom Scripts#
For simple, targeted automation, a well-written Bash or Python script may be sufficient.
Strengths: Simple, no additional infrastructure, easy to version control. Weaknesses: No built-in RBAC or UI, audit trail requires external logging, no approval workflows without wrapping.
Safety Checks and Guardrails#
Automated runbooks execute faster than humans can intervene. A script with a bug causes more damage in 10 seconds than a human could in 10 minutes.
Pre-Execution#
- Target validation. Confirm the target exists and is in the expected state.
- Environment confirmation. Verify you are operating in the intended environment.
- Dependency health. Check that monitoring, logging, and backup systems are available.
- Concurrency guard. Ensure another instance is not already running against the same target.
During Execution#
- Step validation. After each step, verify the expected outcome before proceeding.
- Rate limiting. Process multiple targets in batches with pauses between them.
- Timeout enforcement. Every step has a timeout. Hanging steps fail the runbook, not block it.
Post-Execution#
- Health verification. Verify the system is healthy after completion.
- Metric comparison. Compare key metrics against the pre-execution baseline.
Approval Workflows#
Semi-automated runbooks pause for human approval at critical points.
Pre-execution approval: Runbook waits for a human to approve before any action. Good for scheduled maintenance where an engineer reviews conditions first.
Mid-execution approval: Preparatory steps (gathering data, backups, pre-checks) run automatically, then the runbook pauses before the impactful step. The human sees pre-check results and decides.
Escalating approval: Low-risk steps execute automatically. If unexpected conditions arise or higher-risk action is needed, the runbook pauses for senior approval.
Audit Trails#
Every execution must produce an audit trail answering: who triggered it, when, what it did, and what the outcome was.
{
"runbook": "cache-rebuild",
"execution_id": "exec-2026-0222-143052",
"triggered_by": "alert:CacheHitRateLow",
"triggered_at": "2026-02-22T14:30:52Z",
"approved_by": "oncall-sre@company.com",
"target": "api-server/production",
"steps_executed": [
{"step": "pre-check", "status": "pass", "timestamp": "2026-02-22T14:31:16Z"},
{"step": "cache-rebuild", "status": "pass", "timestamp": "2026-02-22T14:31:45Z"},
{"step": "post-check", "status": "pass", "timestamp": "2026-02-22T14:32:48Z"}
],
"outcome": "success",
"baseline_metrics": {"cache_hit_rate": 0.45},
"post_metrics": {"cache_hit_rate": 0.92}
}Store audit records in a durable, append-only system. The automation system should not be able to modify or delete its own logs.
Agent Operational Notes#
- Start at Level 2. Script the steps but keep human triggering. Move to Level 3 only after multiple successful scripted executions.
- Never skip pre-checks. A pre-check that prevents one bad execution justifies its existence for every future run.
- Make rollback the default. If any step fails or any post-check deviates from expected results, roll back rather than continue.
- Log at the step level. “Steps 1-5 succeeded, step 6 skipped (condition not met), step 7 succeeded” is useful. “The runbook succeeded” is not.
- Treat automation code as production code. Version control, code review, testing, and staged rollouts apply to runbook automation.