Agent Runbook Generation#

An agent that says “you should probably add a readiness probe to your deployment” is giving advice. An agent that hands you a tested manifest with the readiness probe configured, verified against a real cluster, with rollback steps if the probe misconfigures – that agent is producing a deliverable. The difference matters.

The core thesis of infrastructure agent work is that the output is always a deliverable – a runbook, playbook, tested manifest, or validated configuration – never a direct action on someone else’s systems. This article covers the complete workflow for generating those deliverables: understanding requirements, planning steps, executing in a sandbox, capturing what worked, and packaging the result.

Why Deliverables Beat Advice#

When an agent responds to “help me deploy Redis on Kubernetes” with a block of YAML and some explanation, the human gets something that might work. They paste it into a file, apply it, and discover that the storage class does not exist in their cluster, or the resource requests exceed their node capacity, or the password was not base64-encoded correctly.

A tested deliverable eliminates that gap. The agent deployed Redis in a sandbox, confirmed it started, verified the connection, and documented every step. The human receives output that has been proven to work in at least one real environment, along with clear instructions for adapting it to their own.

Three properties make deliverables superior to advice:

Reproducibility. Every step was executed and recorded. The human can replay the exact sequence.

Verified correctness. The deliverable was tested, not theorized. If step 3 depends on step 2 completing successfully, that dependency was exercised.

Error resolution included. When something failed during sandbox testing, the agent fixed it. That fix – and the error that triggered it – is documented in the runbook. The human benefits from problems already solved.

The Runbook Generation Workflow#

Runbook generation follows a five-phase sequence. Each phase produces artifacts that feed the next.

Phase 1: Understand Requirements#

Before writing any configuration, the agent must understand what the human actually needs. This means asking explicit questions when information is missing and never assuming defaults that could be wrong.

Key inputs to gather:

  • Target environment. Kubernetes version, cloud provider, cluster type (minikube, EKS, AKS, GKE, bare metal). The same Helm chart behaves differently on EKS versus minikube.
  • Architecture. CPU architecture of the target nodes. ARM64 versus x86_64 changes which container images are available.
  • Existing infrastructure. What is already running? Does a database exist, or does one need to be created? Is there an ingress controller?
  • Constraints. Resource limits, namespace policies, network policies, security requirements, compliance needs.
  • Success criteria. What does “working” mean? A pod in Running state? A successful HTTP health check? Data persisted across restarts?

A requirements document for a Helm-deployed PostgreSQL with an application might look like:

# requirements.yaml
target:
  kubernetes_version: "1.29"
  cloud_provider: "aws"
  cluster_type: "eks"
  architecture: "amd64"
  namespace: "production"

dependencies:
  existing:
    - ingress_controller: "nginx"
    - cert_manager: true
  to_create:
    - postgresql: "15"
    - application: "myapp:2.1.0"

constraints:
  max_cpu_per_pod: "2"
  max_memory_per_pod: "4Gi"
  storage_class: "gp3"
  network_policy: "deny-all-default"

success_criteria:
  - "PostgreSQL pod is Running and accepting connections"
  - "Application pod passes readiness probe"
  - "Application can read/write to PostgreSQL"
  - "Ingress routes traffic to application"

Phase 2: Plan Steps#

With requirements understood, the agent plans the deployment sequence. This is not just listing the kubectl apply commands – it is ordering operations by their dependencies and identifying where verification checks are needed.

A good plan identifies:

  • Dependency order. The database must be running before the application starts. The namespace must exist before anything is deployed into it.
  • Verification points. After each major step, what should the agent check before proceeding? A pod in Running state, a service with endpoints, a successful connection test.
  • Known risks. What could go wrong at each step? PostgreSQL 15+ permission changes, storage class availability, image pull failures on private registries.
  • Rollback boundaries. If step 4 fails, what needs to be undone from steps 1-3?
Plan: Deploy myapp with PostgreSQL on EKS

1. Create namespace "production"
   Verify: namespace exists and is Active

2. Deploy PostgreSQL via Bitnami Helm chart
   Verify: pod Running, accepting connections on port 5432
   Risk: PostgreSQL 15 schema permissions (OWNER fix needed)

3. Run database initialization (schema + seed data)
   Verify: tables exist, seed data queryable
   Risk: init script fails silently if using \c metacommand

4. Deploy application with database connection string
   Verify: pod Running, readiness probe passing
   Risk: connection refused if database not ready

5. Configure ingress for external access
   Verify: ingress has assigned address, HTTP 200 on health endpoint

Rollback: helm uninstall in reverse order (app, then postgres), delete namespace

Phase 3: Execute in Sandbox#

This is where the plan becomes reality. The agent executes each step in a sandbox environment, recording commands, outputs, timing, and any errors encountered.

The sandbox execution produces a raw log – every command run, every output received, every error hit and resolved. This raw material becomes the runbook.

# Step 1: Create namespace
$ kubectl create namespace production
namespace/production created

# Step 1 verification
$ kubectl get namespace production -o jsonpath='{.status.phase}'
Active

# Step 2: Deploy PostgreSQL
$ helm install db bitnami/postgresql \
    --namespace production \
    --set auth.postgresPassword=adminpass \
    --set auth.database=myapp \
    --set auth.username=myapp \
    --set auth.password=apppass \
    --set primary.persistence.storageClass=gp3 \
    --set primary.resources.requests.memory=512Mi \
    --set primary.resources.requests.cpu=250m
NAME: db
NAMESPACE: production
STATUS: deployed

# Step 2 verification (with wait)
$ kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=postgresql \
    --namespace production --timeout=120s
pod/db-postgresql-0 condition met

# Step 2 connection test
$ kubectl exec -it db-postgresql-0 -n production -- \
    psql -U myapp -d myapp -c "SELECT 1 AS connection_test;"
 connection_test
-----------------
               1

Phase 4: Capture What Worked (and What Failed)#

During sandbox execution, some steps will fail. Those failures are valuable – they are problems the human would have hit. The agent fixes them in the sandbox and records both the error and the resolution.

This is the most important phase for deliverable quality. An agent that hits the PostgreSQL 15 permissions issue, solves it, and documents the fix in the runbook has produced something genuinely useful. An agent that happens to test against PostgreSQL 14 and never encounters the issue has produced a runbook with a latent bug.

Example of a captured error resolution:

## Step 3: Database Initialization

### Initial Attempt (Failed)

Command:

kubectl exec -it db-postgresql-0 -n production –
psql -U myapp -d myapp -c “CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT);”


Error:

ERROR: permission denied for schema public


### Root Cause

PostgreSQL 15 changed default permissions on the public schema. The `myapp` user has
CONNECT and USAGE privileges but cannot create objects in the public schema unless
it owns the schema.

### Resolution

kubectl exec -it db-postgresql-0 -n production –
psql -U postgres -d myapp -c “ALTER SCHEMA public OWNER TO myapp;”


After this fix, the CREATE TABLE command succeeds. This ownership change should be
included in the Helm values as an initdb script to run automatically on first boot.

Phase 5: Package as Deliverable#

The raw sandbox session becomes a polished runbook. The agent organizes the output into a format humans can follow and machines can parse.

Runbook Output Formats#

A good runbook ships in multiple formats because different consumers need different things.

Markdown for Humans#

The primary format. Readable, printable, reviewable in a pull request.

# Runbook: Deploy myapp with PostgreSQL on EKS

## Prerequisites
- EKS cluster running Kubernetes 1.29+
- kubectl configured for target cluster
- Helm 3.x installed
- nginx ingress controller running
- gp3 StorageClass available

## Steps

### 1. Create Namespace

kubectl create namespace production

**Verify:** `kubectl get namespace production` shows status Active.

### 2. Deploy PostgreSQL

helm install db bitnami/postgresql
–namespace production
–values postgres-values.yaml

**Verify:** `kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=postgresql -n production --timeout=120s`

**Known issue:** PostgreSQL 15+ requires schema ownership fix. The included
postgres-values.yaml handles this via an initdb script. If deploying to
PostgreSQL 14 or earlier, the initdb script is harmless but unnecessary.

### Rollback

helm uninstall myapp -n production helm uninstall db -n production kubectl delete namespace production


### What Success Looks Like
- `kubectl get pods -n production` shows all pods Running, 1/1 Ready
- `curl -f https://myapp.example.com/health` returns HTTP 200
- Application logs show successful database connection

Structured YAML for Agents#

When another agent consumes the runbook, it needs structured data, not prose.

runbook:
  name: "deploy-myapp-postgresql-eks"
  version: "1.0.0"
  tested_on:
    kubernetes: "1.29"
    helm: "3.14"
    postgresql_chart: "14.2.3"
    cloud: "aws-eks"

  prerequisites:
    - tool: "kubectl"
      min_version: "1.28"
    - tool: "helm"
      min_version: "3.12"
    - resource: "storageclass/gp3"
    - resource: "ingressclass/nginx"

  steps:
    - id: "create-namespace"
      command: "kubectl create namespace production"
      verify:
        command: "kubectl get namespace production -o jsonpath='{.status.phase}'"
        expect: "Active"
      rollback: "kubectl delete namespace production"

    - id: "deploy-postgresql"
      command: "helm install db bitnami/postgresql -n production -f postgres-values.yaml"
      verify:
        command: "kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=postgresql -n production --timeout=120s"
        expect_exit_code: 0
      rollback: "helm uninstall db -n production"
      known_issues:
        - id: "pg15-schema-permissions"
          description: "PostgreSQL 15+ requires ALTER SCHEMA public OWNER TO for DDL"
          resolved_in: "postgres-values.yaml initdb script"

  success_criteria:
    - command: "kubectl get pods -n production -o jsonpath='{.items[*].status.phase}'"
      expect_all: "Running"
    - command: "curl -sf https://myapp.example.com/health"
      expect_exit_code: 0

Executable Shell Script#

For automation. Every step from the runbook in an executable format with built-in verification.

#!/usr/bin/env bash
set -euo pipefail

# Runbook: Deploy myapp with PostgreSQL on EKS
# Generated: 2026-02-22
# Tested on: Kubernetes 1.29, Helm 3.14, EKS

NAMESPACE="production"
TIMEOUT="120s"

echo "=== Step 1: Create namespace ==="
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
kubectl get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' | grep -q "Active"
echo "Namespace verified."

echo "=== Step 2: Deploy PostgreSQL ==="
helm upgrade --install db bitnami/postgresql \
    --namespace "$NAMESPACE" \
    --values postgres-values.yaml \
    --wait --timeout "$TIMEOUT"
echo "PostgreSQL deployed and ready."

echo "=== Step 3: Deploy application ==="
helm upgrade --install myapp ./charts/myapp \
    --namespace "$NAMESPACE" \
    --values myapp-values.yaml \
    --wait --timeout "$TIMEOUT"
echo "Application deployed and ready."

echo "=== Verification ==="
kubectl get pods -n "$NAMESPACE"
echo ""
echo "All steps completed. Verify application health at your ingress endpoint."

What Makes a Good Runbook#

After generating hundreds of runbooks, patterns emerge that separate useful ones from decorative ones.

Prerequisites are explicit and verifiable. “You need Helm” is not a prerequisite. “Helm 3.12+ installed (verify: helm version --short)” is a prerequisite. Each prerequisite should include a command the human can run to confirm they meet it.

Every step has a verification check. The human should never wonder “did that work?” after running a command. The runbook tells them exactly what to check and what the expected output looks like.

Rollback procedures are specific. “Undo the changes” is not a rollback procedure. “Run helm uninstall db -n production then kubectl delete pvc data-db-postgresql-0 -n production” is a rollback procedure. Note the PVC deletion – Helm uninstall does not remove PVCs by default, and leaving orphaned storage is a common post-rollback surprise.

Error resolutions are documented inline. When the agent hit an error during sandbox testing and resolved it, that resolution appears next to the step where it occurred, not in an appendix. The human encounters the same context the agent did.

“What success looks like” is concrete. Not “the application should be working” but “all pods show Running 1/1, the health endpoint returns HTTP 200, and the application logs contain ‘Connected to database’ within the first 30 seconds of startup.”

Versioning Runbooks#

Infrastructure changes underneath runbooks. A runbook tested against Kubernetes 1.29 may not work on 1.31 if deprecated APIs were removed. A runbook using Bitnami PostgreSQL chart 14.x may break on 15.x if values schema changed.

Every runbook should record the versions it was tested against:

tested_versions:
  kubernetes: "1.29.2"
  helm: "3.14.0"
  charts:
    bitnami/postgresql: "14.2.3"
    app_chart: "2.1.0"
  images:
    postgresql: "docker.io/bitnami/postgresql:15.6.0-debian-12-r5"
    application: "myregistry/myapp:2.1.0"
  cloud_provider:
    name: "aws"
    service: "eks"
    region: "us-east-1"

When a human uses the runbook against different versions, they know which components might behave differently. The Kubernetes version matters most – API deprecations between minor versions regularly break manifests. Chart versions matter next – Helm chart maintainers rename values keys, change default behaviors, and restructure templates between major versions.

Practical Example: Complete Runbook from a Sandbox Session#

Here is a condensed but complete runbook generated from a sandbox session deploying a Helm chart with a database dependency. This shows the full flow from sandbox execution to packaged deliverable.

# Runbook: Deploy Notification Service with PostgreSQL

**Generated:** 2026-02-22
**Tested on:** Kubernetes 1.29.2 (EKS), Helm 3.14.0
**Sandbox result:** All steps passed, 2 issues encountered and resolved

## Prerequisites

| Requirement | Verify Command | Expected |
|---|---|---|
| kubectl 1.28+ | `kubectl version --client --short` | v1.28+ |
| Helm 3.12+ | `helm version --short` | v3.12+ |
| EKS cluster access | `kubectl get nodes` | Node list returned |
| gp3 StorageClass | `kubectl get sc gp3` | StorageClass found |

## Step 1: Create Namespace

kubectl create namespace notifications

**Verify:** `kubectl get ns notifications -o jsonpath='{.status.phase}'` returns `Active`

## Step 2: Deploy PostgreSQL

helm install notif-db bitnami/postgresql \
    --namespace notifications \
    --set auth.postgresPassword=CHANGE_ME \
    --set auth.database=notifications \
    --set auth.username=notif_user \
    --set auth.password=CHANGE_ME \
    --set primary.persistence.storageClass=gp3 \
    --set primary.persistence.size=20Gi \
    --set primary.resources.requests.memory=512Mi \
    --set primary.resources.requests.cpu=250m \
    --set initdbScripts."fix-permissions\.sh"="#!/bin/bash
PGPASSWORD=\"\$POSTGRES_PASSWORD\" psql -U postgres -d notifications -c \"ALTER SCHEMA public OWNER TO notif_user;\"
PGPASSWORD=\"\$POSTGRES_PASSWORD\" psql -U postgres -d notifications -c \"ALTER DATABASE notifications OWNER TO notif_user;\""

**Verify:** `kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=postgresql -n notifications --timeout=120s`

**Issue encountered during testing:** Without the initdbScripts block, the application
failed with `permission denied for schema public` when running its first migration.
This is a PostgreSQL 15+ behavior change. The initdb script above resolves it by
transferring schema ownership on first boot.

## Step 3: Deploy Notification Service

helm install notif-app ./charts/notification-service \
    --namespace notifications \
    --set database.host=notif-db-postgresql \
    --set database.port=5432 \
    --set database.name=notifications \
    --set database.user=notif_user \
    --set database.password=CHANGE_ME \
    --set resources.requests.memory=256Mi \
    --set resources.requests.cpu=100m

**Verify:**
- `kubectl wait --for=condition=ready pod -l app=notification-service -n notifications --timeout=120s`
- `kubectl logs -l app=notification-service -n notifications | grep "Connected to database"` shows connection success

**Issue encountered during testing:** The application crashed on startup with
`connection refused` when deployed simultaneously with PostgreSQL. Adding
`--wait` to the PostgreSQL helm install (Step 2) or using an init container
that waits for the database resolves this. The chart includes an init container
by default when `database.waitForReady=true`.

## Rollback

Execute in reverse order:
helm uninstall notif-app -n notifications
helm uninstall notif-db -n notifications
kubectl delete pvc data-notif-db-postgresql-0 -n notifications
kubectl delete namespace notifications

Note: The PVC must be deleted explicitly. Helm uninstall does not remove
PersistentVolumeClaims. Leaving the PVC orphaned wastes storage and can
cause conflicts if you reinstall with the same release name.

## What Success Looks Like

kubectl get pods -n notifications should show:

NAME                                    READY   STATUS    RESTARTS   AGE
notif-app-notification-service-xxx      1/1     Running   0          2m
notif-db-postgresql-0                   1/1     Running   0          4m

The notification service health endpoint returns HTTP 200.
Application logs show database migrations completed successfully.

This runbook captures two real issues the agent encountered and resolved during sandbox testing. A human following it will not hit those issues because the fixes are built into the steps. That is the value of tested deliverables over advice.