Validation Playbook Format#

A validation playbook is a structured procedure that tells an agent exactly how to validate a specific type of infrastructure change. The key problem it solves: the same validation (for example, “verify this Helm chart works”) requires different commands depending on whether the agent has access to kind, minikube, a cloud cluster, or nothing but a linter. A playbook encodes all path variants in one document so the agent picks the right commands for its environment.

This article defines the playbook format, explains how agents should read and execute playbooks, covers graceful degradation when the recommended path is unavailable, and provides three complete example playbooks.

The Playbook Format#

Every playbook follows this structure:

playbook:
  name: "short-identifier"
  description: "What this playbook validates and why"
  recommended_path: 2  # The path level that gives best results
  minimum_path: 0      # The lowest path that provides any value

  prerequisites:
    tools:
      - name: "helm"
        required_for_paths: [0, 1, 2, 3, 4]
      - name: "kubectl"
        required_for_paths: [1, 2, 3, 4]
      - name: "kind"
        required_for_paths: [1, 4]
      - name: "minikube"
        required_for_paths: [2]
    files:
      - path: "Chart.yaml"
        description: "Helm chart metadata"

  steps:
    - name: "step-identifier"
      description: "What this step does"
      variants:
        path_0:
          commands:
            - "helm lint ./my-chart"
          expected_output: "0 chart(s) linted, 0 chart(s) failed"
          validates: "Chart structure and syntax"
        path_1:
          commands:
            - "kind create cluster --name test"
            - "helm install test ./my-chart --wait"
          expected_output: "STATUS: deployed"
          validates: "Chart installs and pods start"
          teardown:
            - "kind delete cluster --name test"
        path_2:
          commands:
            - "minikube start"
            - "helm install test ./my-chart --wait"
          expected_output: "STATUS: deployed"
          validates: "Chart installs with full addon support"
          teardown:
            - "minikube delete"

  verification:
    - check: "description of what to verify"
      command: "command to run"
      success_criteria: "what output indicates success"
      failure_action: "what to do if this check fails"

  outputs:
    success: "Summary statement when all checks pass"
    partial: "Summary when some checks pass with noted gaps"
    failure: "Summary when critical checks fail"

Format Fields Explained#

name: A short, dash-separated identifier. Used in logs and reports to reference the playbook.

description: One to two sentences explaining what this playbook validates and when an agent should use it.

recommended_path: The validation path (0-4) that provides the best fidelity for this playbook. This is what an agent should use if resources are available.

minimum_path: The lowest path that still provides meaningful validation. Below this, the playbook cannot validate anything useful for its stated purpose.

prerequisites.tools: Each tool required, annotated with which paths need it. An agent checks its available tools against this list to determine which paths it can execute.

prerequisites.files: Files that must exist before the playbook runs. The agent verifies these before starting.

steps: The ordered sequence of validation actions. Each step has variants for different paths. An agent selects the variant matching its chosen path.

steps.variants.commands: Shell commands to execute, in order. If any command fails, the step fails.

steps.variants.expected_output: A string or pattern the agent should look for in command output to confirm success.

steps.variants.validates: A human-readable statement of what this variant actually proves. This becomes part of the final report.

steps.variants.teardown: Commands to run after the step, regardless of success or failure. These are required for paths that create resources.

verification: Post-deployment checks that run after all steps complete. These are path-independent where possible, though some checks only work on certain paths.

outputs: Template statements for the final validation report.

How Agents Execute Playbooks#

An agent encountering a validation playbook should follow this procedure:

Step 1: Assess Available Resources#

Before looking at the playbook’s steps, the agent determines what it has access to.

Check: Is Docker running?        -> Paths 1, 2, 3 possible
Check: Is kind installed?        -> Path 1 possible
Check: Is minikube installed?    -> Path 2 possible
Check: Are cloud CLIs configured? -> Path 3 possible
Check: Is internet available?    -> Path 4 possible
Always available:                -> Path 0

This assessment produces a list of available paths. The agent does not need to install tools – it works with what is already present.

Step 2: Select a Path#

Compare available paths against the playbook’s recommended_path and minimum_path.

If recommended_path is available:
  Use it.
Else if any path between minimum_path and recommended_path is available:
  Use the highest available path.
Else if minimum_path is available:
  Use it, but flag reduced fidelity in the report.
Else:
  Report that validation cannot be performed.
  List what tools would be needed for minimum_path.

Step 3: Verify Prerequisites#

Check that all tools listed for the selected path are available and that all prerequisite files exist.

# Example prerequisite check
for tool in helm kubectl kind; do
  if ! command -v "${tool}" &> /dev/null; then
    echo "MISSING: ${tool}"
    exit 1
  fi
done

if [ ! -f "Chart.yaml" ]; then
  echo "MISSING: Chart.yaml"
  exit 1
fi

Step 4: Execute Steps#

For each step in the playbook, execute the variant matching the selected path. If a step has no variant for the selected path, skip it and note it in the report.

The agent must:

  • Execute commands in order within each step.
  • Check output against expected_output after each command.
  • Run teardown commands even if the step fails.
  • Stop execution if a step fails (unless the step is marked as optional).
  • Capture all command output for the report.

Step 5: Run Verification Checks#

After all steps complete successfully, run each verification check. These provide the final confirmation that the validation target is working correctly.

Step 6: Generate Report#

The report must include:

  • Which path was used.
  • Which steps were executed and their results.
  • Which steps were skipped (and why).
  • What the selected path validates versus what it cannot validate.
  • The appropriate output statement (success, partial, or failure).

Graceful Degradation#

When the recommended path is not available, the agent degrades gracefully. This means running what it can and being explicit about what it could not verify.

Degradation Strategies#

Drop runtime checks, keep static checks. If kind/minikube/cloud are unavailable, every static validation step still runs. The agent validates syntax, schema, policy compliance, and template rendering. It then reports: “Static validation passed. Runtime validation was not performed because [reason]. The following could not be verified: [list].”

Substitute lighter-weight runtime checks. If minikube is unavailable but Docker is, substitute Docker Compose or kind for steps that test container startup and service connectivity. The fidelity is lower, but it still catches container build failures and basic connectivity issues.

Simulate where possible. For cloud-specific checks (IAM policies, storage classes), use dry-run or simulation modes. terraform plan simulates without creating resources. aws iam simulate-principal-policy tests IAM policies without assuming roles. These are not full validation but are better than nothing.

Explicitly list gaps. The degradation report must itemize what was not tested. Generic statements like “some checks were skipped” are not useful. Specific statements like “Network policy enforcement was not tested because kind’s default CNI does not support it” tell the reader exactly what risk remains.

Degradation Example#

A playbook recommends Path 2 (minikube) for validating a Helm chart with ingress. The agent only has Path 0 (static tools).

What the agent does:

  • Runs helm lint – validates chart structure.
  • Runs helm template | kubeconform – validates generated manifests against schemas.
  • Runs conftest test – validates against policies.
  • Checks that the Ingress resource has correct annotations and host rules by inspecting the rendered template.

What the agent reports:

Validation: validate-helm-chart
Path used: 0 (static)
Recommended path: 2 (minikube)

Passed:
- Chart lint: no errors
- Schema validation: all manifests valid
- Policy compliance: all policies passed
- Ingress resource: annotations present, host rules defined

Not validated (requires Path 2+):
- Pods actually start and pass health checks
- Ingress controller routes traffic to the service
- TLS termination works with the configured certificate
- Service discovery resolves correctly between pods

Risk: Medium. Static checks confirm the chart is structurally correct,
but runtime behavior (especially ingress routing) is unverified.

Example Playbook 1: Validate Helm Chart#

Works on all five paths with different fidelity at each level.

playbook:
  name: "validate-helm-chart"
  description: "Validates a Helm chart from syntax through runtime behavior"
  recommended_path: 2
  minimum_path: 0

  prerequisites:
    tools:
      - name: "helm"
        required_for_paths: [0, 1, 2, 3, 4]
      - name: "kubeconform"
        required_for_paths: [0, 1, 2, 3, 4]
      - name: "kubectl"
        required_for_paths: [1, 2, 3, 4]
      - name: "kind"
        required_for_paths: [1, 4]
      - name: "minikube"
        required_for_paths: [2]
    files:
      - path: "Chart.yaml"
        description: "Helm chart metadata file"
      - path: "values.yaml"
        description: "Default values file"

  steps:
    - name: "lint-chart"
      description: "Run Helm linter to catch structural issues"
      variants:
        path_0: &lint
          commands:
            - "helm lint ./chart"
          expected_output: "0 chart(s) failed"
          validates: "Chart structure, required files, values schema"
        path_1: *lint
        path_2: *lint
        path_3: *lint
        path_4: *lint

    - name: "validate-templates"
      description: "Render templates and validate against Kubernetes schemas"
      variants:
        path_0: &template_validate
          commands:
            - "helm template test-release ./chart --values values.yaml > rendered.yaml"
            - "kubeconform -strict -summary rendered.yaml"
          expected_output: "Summary.*0 invalid"
          validates: "All rendered manifests conform to Kubernetes API schemas"
        path_1: *template_validate
        path_2: *template_validate
        path_3: *template_validate
        path_4: *template_validate

    - name: "dry-run-install"
      description: "Server-side dry run against a real API server"
      variants:
        path_1:
          commands:
            - "kind create cluster --name helm-validate --wait 60s"
            - "helm install test-release ./chart --values values.yaml --dry-run=server"
          expected_output: "STATUS: pending-install"
          validates: "API server accepts the rendered manifests (catches CRDs, admission webhooks)"
          teardown:
            - "kind delete cluster --name helm-validate"
        path_2:
          commands:
            - "minikube start --memory=4096"
            - "helm install test-release ./chart --values values.yaml --dry-run=server"
          expected_output: "STATUS: pending-install"
          validates: "API server accepts the rendered manifests"
          teardown:
            - "minikube delete"

    - name: "full-install"
      description: "Install the chart and verify pods are running"
      variants:
        path_1:
          commands:
            - "kind create cluster --name helm-validate --wait 60s"
            - "helm install test-release ./chart --values values.yaml --wait --timeout 180s"
            - "kubectl get pods -o wide"
            - "kubectl wait --for=condition=ready pods --all --timeout=120s"
          expected_output: "condition met"
          validates: "Chart installs, pods schedule, containers start, readiness probes pass"
          teardown:
            - "kind delete cluster --name helm-validate"
        path_2:
          commands:
            - "minikube start --cpus=4 --memory=8192 --addons=ingress,metrics-server"
            - "helm install test-release ./chart --values values.yaml --wait --timeout 180s"
            - "kubectl get pods,svc,ingress -o wide"
            - "kubectl wait --for=condition=ready pods --all --timeout=120s"
          expected_output: "condition met"
          validates: "Full install including ingress routing and metrics collection"
          teardown:
            - "minikube delete"
        path_3:
          commands:
            - "terraform -chdir=infra/validation init"
            - "terraform -chdir=infra/validation apply -auto-approve"
            - "aws eks update-kubeconfig --name validation-cluster"
            - "helm install test-release ./chart --values values.yaml --wait --timeout 300s"
            - "kubectl get pods,svc,ingress -o wide"
          expected_output: "condition met"
          validates: "Full cloud install with real LBs, storage classes, and IAM"
          teardown:
            - "terraform -chdir=infra/validation destroy -auto-approve"

    - name: "helm-test"
      description: "Run Helm test hooks if defined"
      variants:
        path_1:
          commands:
            - "helm test test-release --timeout 60s"
          expected_output: "PASSED"
          validates: "Chart-defined test hooks pass"
        path_2:
          commands:
            - "helm test test-release --timeout 60s"
          expected_output: "PASSED"
          validates: "Chart-defined test hooks pass"

  verification:
    - check: "No pods in CrashLoopBackOff"
      command: "kubectl get pods --field-selector=status.phase!=Running,status.phase!=Succeeded"
      success_criteria: "No resources found"
      failure_action: "Run kubectl describe pod and kubectl logs on failing pods"
    - check: "No pending PVCs"
      command: "kubectl get pvc --field-selector=status.phase=Pending"
      success_criteria: "No resources found"
      failure_action: "Check storage class availability and PV provisioner"
    - check: "Services have endpoints"
      command: "kubectl get endpoints -o custom-columns=NAME:.metadata.name,ENDPOINTS:.subsets[*].addresses[*].ip"
      success_criteria: "All services show at least one endpoint IP"
      failure_action: "Check selector labels match between Service and Deployment"

  outputs:
    success: "Helm chart validated successfully. All pods running, services have endpoints, tests passed."
    partial: "Helm chart partially validated. Static checks passed. Runtime checks limited by available path."
    failure: "Helm chart validation failed. See step results for details."

Example Playbook 2: Test Database Migration#

Requires at least Path 1 (Docker) to be useful. Running migrations requires a real database.

playbook:
  name: "test-database-migration"
  description: "Validates database migrations run cleanly against target PostgreSQL versions"
  recommended_path: 1
  minimum_path: 1

  prerequisites:
    tools:
      - name: "docker"
        required_for_paths: [1, 2, 3, 4]
      - name: "psql"
        required_for_paths: [1, 2, 3, 4]
    files:
      - path: "migrations/"
        description: "Directory containing .sql migration files"

  steps:
    - name: "start-databases"
      description: "Start PostgreSQL containers for each target version"
      variants:
        path_1:
          commands:
            - "docker run -d --name pg15-test -e POSTGRES_PASSWORD=test -e POSTGRES_DB=migtest -p 5415:5432 postgres:15-alpine"
            - "docker run -d --name pg16-test -e POSTGRES_PASSWORD=test -e POSTGRES_DB=migtest -p 5416:5432 postgres:16-alpine"
            - "sleep 5"
            - "docker exec pg15-test pg_isready -U postgres"
            - "docker exec pg16-test pg_isready -U postgres"
          expected_output: "accepting connections"
          validates: "Database containers start and accept connections"
          teardown:
            - "docker rm -f pg15-test pg16-test 2>/dev/null || true"

    - name: "run-migrations"
      description: "Apply all migrations in order to each database"
      variants:
        path_1:
          commands:
            - |
              for port in 5415 5416; do
                echo "=== PostgreSQL on port ${port} ==="
                for f in $(ls migrations/*.sql | sort); do
                  echo "Applying: $(basename $f)"
                  PGPASSWORD=test psql -h localhost -p ${port} -U postgres -d migtest -f "$f" -v ON_ERROR_STOP=1
                done
              done
          expected_output: "no errors"
          validates: "All migrations apply cleanly on PG 15 and PG 16"

    - name: "verify-schema"
      description: "Confirm the final schema matches expectations"
      variants:
        path_1:
          commands:
            - |
              for port in 5415 5416; do
                echo "=== Schema on port ${port} ==="
                PGPASSWORD=test psql -h localhost -p ${port} -U postgres -d migtest \
                  -c "SELECT tablename FROM pg_tables WHERE schemaname='public' ORDER BY tablename;"
              done
          expected_output: "expected table names"
          validates: "Final schema is consistent across PostgreSQL versions"

    - name: "test-rollback"
      description: "If down migrations exist, verify they execute cleanly"
      variants:
        path_1:
          commands:
            - |
              if ls migrations/*down*.sql 1>/dev/null 2>&1; then
                for port in 5415 5416; do
                  for f in $(ls migrations/*down*.sql | sort -r); do
                    PGPASSWORD=test psql -h localhost -p ${port} -U postgres -d migtest -f "$f" -v ON_ERROR_STOP=1
                  done
                done
              else
                echo "No down migrations found, skipping rollback test"
              fi
          validates: "Rollback migrations execute without errors"

  verification:
    - check: "Schema identical across versions"
      command: |
        diff <(PGPASSWORD=test psql -h localhost -p 5415 -U postgres -d migtest -c "\\d+" 2>/dev/null) \
             <(PGPASSWORD=test psql -h localhost -p 5416 -U postgres -d migtest -c "\\d+" 2>/dev/null)
      success_criteria: "No differences in output"
      failure_action: "Review migration SQL for version-specific syntax"

  outputs:
    success: "Migrations validated on PostgreSQL 15 and 16. Schema consistent across versions."
    partial: "Migrations applied but schema differences detected between versions."
    failure: "Migration failed on one or more PostgreSQL versions."

Example Playbook 3: Verify Network Policies#

Requires at least Path 2 (minikube with Calico) because kind’s default CNI does not enforce NetworkPolicy resources. Testing network policies without enforcement is meaningless.

playbook:
  name: "verify-network-policies"
  description: "Validates that NetworkPolicy resources correctly restrict traffic between pods"
  recommended_path: 2
  minimum_path: 2

  prerequisites:
    tools:
      - name: "minikube"
        required_for_paths: [2]
      - name: "kubectl"
        required_for_paths: [2, 3]
      - name: "helm"
        required_for_paths: [2, 3]
      - name: "kubeconform"
        required_for_paths: [0]
    files:
      - path: "network-policies/"
        description: "Directory containing NetworkPolicy YAML manifests"

  steps:
    - name: "validate-syntax"
      description: "Static validation of NetworkPolicy manifests"
      variants:
        path_0:
          commands:
            - "kubeconform -strict network-policies/*.yaml"
          expected_output: "0 invalid"
          validates: "NetworkPolicy manifests are syntactically valid"
        path_2: &np_syntax
          commands:
            - "kubeconform -strict network-policies/*.yaml"
          expected_output: "0 invalid"
          validates: "NetworkPolicy manifests are syntactically valid"
        path_3: *np_syntax

    - name: "create-cluster-with-cni"
      description: "Start a cluster with a CNI that enforces network policies"
      variants:
        path_2:
          commands:
            - "minikube start --cpus=4 --memory=8192 --cni=calico"
            - "kubectl wait --for=condition=ready pods -l k8s-app=calico-node -n kube-system --timeout=120s"
          expected_output: "condition met"
          validates: "Calico CNI is running and ready to enforce policies"
          teardown:
            - "minikube delete"
        path_3:
          commands:
            - "terraform -chdir=infra/validation init"
            - "terraform -chdir=infra/validation apply -auto-approve"
          validates: "Cloud cluster with network policy support is running"
          teardown:
            - "terraform -chdir=infra/validation destroy -auto-approve"

    - name: "deploy-test-workloads"
      description: "Deploy pods that will be subjects of network policy testing"
      variants:
        path_2: &deploy_workloads
          commands:
            - |
              kubectl create namespace policy-test
              kubectl -n policy-test run frontend --image=nginx:alpine --labels="role=frontend" --expose --port=80
              kubectl -n policy-test run backend --image=nginx:alpine --labels="role=backend" --expose --port=80
              kubectl -n policy-test run database --image=nginx:alpine --labels="role=database" --expose --port=80
              kubectl -n policy-test wait --for=condition=ready pods --all --timeout=60s
          expected_output: "condition met"
          validates: "Test pods are running with correct labels"
        path_3: *deploy_workloads

    - name: "verify-default-connectivity"
      description: "Confirm all pods can reach each other before policies are applied"
      variants:
        path_2: &verify_open
          commands:
            - |
              kubectl -n policy-test exec frontend -- wget -qO- --timeout=5 http://backend 2>&1 || true
              kubectl -n policy-test exec frontend -- wget -qO- --timeout=5 http://database 2>&1 || true
              kubectl -n policy-test exec backend -- wget -qO- --timeout=5 http://frontend 2>&1 || true
              kubectl -n policy-test exec backend -- wget -qO- --timeout=5 http://database 2>&1 || true
          expected_output: "nginx"
          validates: "Without policies, all pods can communicate freely"
        path_3: *verify_open

    - name: "apply-policies"
      description: "Apply the NetworkPolicy manifests under test"
      variants:
        path_2: &apply_policies
          commands:
            - "kubectl -n policy-test apply -f network-policies/"
            - "sleep 5"
          validates: "Network policies applied. Calico needs a few seconds to program iptables."
        path_3: *apply_policies

    - name: "verify-allowed-traffic"
      description: "Confirm traffic that should be allowed still works"
      variants:
        path_2: &verify_allowed
          commands:
            - |
              echo "Testing: frontend -> backend (should be ALLOWED)"
              kubectl -n policy-test exec frontend -- wget -qO- --timeout=5 http://backend && echo "PASS: frontend->backend allowed" || echo "FAIL: frontend->backend blocked"
              echo "Testing: backend -> database (should be ALLOWED)"
              kubectl -n policy-test exec backend -- wget -qO- --timeout=5 http://database && echo "PASS: backend->database allowed" || echo "FAIL: backend->database blocked"
          expected_output: "PASS"
          validates: "Explicitly allowed traffic paths are functional"
        path_3: *verify_allowed

    - name: "verify-denied-traffic"
      description: "Confirm traffic that should be blocked is actually blocked"
      variants:
        path_2: &verify_denied
          commands:
            - |
              echo "Testing: frontend -> database (should be DENIED)"
              if kubectl -n policy-test exec frontend -- wget -qO- --timeout=5 http://database 2>/dev/null; then
                echo "FAIL: frontend->database was allowed (should be denied)"
                exit 1
              else
                echo "PASS: frontend->database correctly denied"
              fi
          expected_output: "PASS.*correctly denied"
          validates: "Denied traffic paths are actually blocked by the CNI"
        path_3: *verify_denied

  verification:
    - check: "Network policies are active"
      command: "kubectl -n policy-test get networkpolicies"
      success_criteria: "All expected policies listed"
      failure_action: "Check that policies were applied to the correct namespace"
    - check: "No pods restarted during testing"
      command: "kubectl -n policy-test get pods -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount"
      success_criteria: "All restart counts are 0"
      failure_action: "Network policy changes should not cause pod restarts. Investigate if they did."

  outputs:
    success: "Network policies validated. Allowed traffic flows correctly, denied traffic is blocked."
    partial: "Network policies applied but some traffic tests produced unexpected results."
    failure: "Network policy validation failed. Traffic was not restricted as expected."

Why Path 0 is Insufficient for Network Policies#

Path 0 can validate that a NetworkPolicy YAML file is syntactically correct and conforms to the Kubernetes schema. It cannot tell you whether the policy actually blocks traffic. A NetworkPolicy that passes kubeconform might still have a selector that matches nothing, an ingress rule with the wrong port, or a policyTypes field that omits egress when egress rules are present. These are logic errors that only manifest at runtime with an enforcing CNI.

If an agent only has Path 0 available for network policy validation, it should run the syntax check, report what it verified, and explicitly state: “Network policy enforcement was not tested. A cluster with a CNI that supports NetworkPolicy (Calico, Cilium, or a cloud provider CNI) is required to verify that traffic is actually restricted.”

Building Custom Playbooks#

When none of the provided playbooks match a validation scenario, agents can construct custom playbooks following the format above. The process is:

  1. Define what needs validation. Be specific: “Verify that the Redis Sentinel failover works when the master pod is deleted” is a good target. “Test Redis” is too vague.

  2. Determine the minimum path. What is the lightest environment that can answer the question? If you need a running Redis cluster, Path 1 (Docker/kind) is the minimum. If you need to test cloud-specific Redis (ElastiCache), Path 3 is minimum.

  3. Write steps from bottom up. Start with the simplest path variant and add complexity for higher paths. Path 0 always starts with static validation. Path 1 adds container-level checks. Path 2 adds Kubernetes-specific checks. Path 3 adds cloud-specific checks.

  4. Define verification checks that are path-independent where possible. A check like “Redis responds to PING” works on any path that has a running Redis instance. A check like “ElastiCache encryption at rest is enabled” only works on Path 3.

  5. Write explicit output statements. The success, partial, and failure messages should tell a human exactly what was and was not verified without needing to read the full playbook.

The playbook format is deliberately simple. It is YAML that an agent can parse, but it is also readable by a human reviewing what the agent validated. Resist the urge to add complexity. If a validation scenario requires conditionals, loops, or dynamic behavior beyond path selection, it belongs in a script called from a playbook step, not in the playbook format itself.