Sandbox to Production#

An agent that produces infrastructure deliverables works in a sandbox. It does not touch production. It does not reach into someone else’s cluster, database, or cloud account. It works in an isolated environment, tests its work, captures the results, and hands the human a verified deliverable they can execute on their own infrastructure.

This is not a limitation – it is a design choice. The output is always a deliverable, never a direct action on someone else’s systems. This boundary is what makes the approach safe enough for production infrastructure work and trustworthy enough for enterprise change management.

This article covers the complete workflow from sandbox testing to production deployment: choosing the right sandbox, running the test-validate-document cycle, adapting sandbox results for production, accounting for sandbox limitations, and executing the handoff.

Why Sandbox Testing Matters#

Infrastructure has a surface area problem. A Helm chart that works on a local kind cluster might fail on EKS because of IAM roles, fail on AKS because of network policy defaults, or fail on GKE because of Workload Identity requirements. The only way to discover these failures is to test against a realistic environment.

Without sandbox testing, an agent produces deliverables based on documentation and training data. Documentation is often incomplete, outdated, or wrong for specific versions. Training data represents the state of infrastructure at training time, not the state of the tools and APIs today. A Kubernetes API removed in 1.29 will appear perfectly valid to an agent trained on 1.28-era content.

Sandbox testing catches three categories of problems:

Cloud-specific gotchas. AWS EKS requires the AWS Load Balancer Controller for Ingress resources. AKS uses a different annotation scheme for internal load balancers. GKE handles node auto-provisioning differently. These are not documented in generic Kubernetes guides – they only surface when you test on the actual cloud.

Version-specific behavior changes. PostgreSQL 15 changed schema permissions. Kubernetes 1.25 removed PodSecurityPolicy. Helm 3.13 changed OCI registry behavior. These changes are well-documented individually but easy to miss when an agent is composing multiple tools at specific versions.

Integration failures. Individual components work fine in isolation. The database starts, the application starts, the ingress controller starts. But the application cannot reach the database because of a network policy, or the ingress controller cannot route to the application because the service selector is wrong, or the TLS certificate does not match the hostname. Integration failures only appear when everything runs together.

Sandbox Environment Selection#

Not all sandbox environments are equal. The right choice depends on what the agent is testing and how long the testing takes.

Ephemeral Sandboxes#

Lifetime: Minutes. Destroyed when the task completes. Use case: Quick validation of a single operation.

An ephemeral sandbox spins up, runs one test, and disappears. Use it when the agent needs to verify that a Helm chart installs cleanly, that a Docker image builds successfully, or that a Terraform plan generates the expected resources.

Agent task: "Validate this Helm chart installs on Kubernetes 1.29"

1. Spin up ephemeral sandbox with k8s 1.29
2. helm install --dry-run --debug (syntax validation)
3. helm install (actual deployment)
4. kubectl wait for pods to be ready
5. Capture results
6. Sandbox destroyed

Ephemeral sandboxes are cheap and fast. They are the default choice for single-step validation.

Session Sandboxes#

Lifetime: Hours. Persists across multiple agent interactions within a session. Use case: Multi-step debugging, iterative development, integration testing.

A session sandbox stays alive while the agent works through a complex task. The agent deploys a database, tests the connection, deploys an application, debugs a configuration issue, fixes it, and re-tests – all in the same environment. State accumulates across steps.

Agent task: "Set up the complete notification stack (app + database + cache + queue)"

1. Spin up session sandbox with k8s 1.29 + pre-installed tools
2. Deploy PostgreSQL -> verify -> fix permissions issue -> verify again
3. Deploy Redis -> verify
4. Deploy RabbitMQ -> verify
5. Deploy application -> verify connections to all dependencies
6. Debug: application cannot reach RabbitMQ (network policy)
7. Fix network policy -> verify
8. Run integration test suite
9. Capture complete runbook from session
10. Sandbox destroyed when session ends

Session sandboxes cost more than ephemeral ones because they run longer, but they enable the iterative problem-solving that produces the best runbooks. The errors the agent encounters and resolves during the session become documented solutions in the final deliverable.

Persistent Environments#

Lifetime: Days to weeks. Survives across sessions. Use case: Ongoing staging, long-running tests, performance benchmarks.

A persistent environment acts as a staging ground that multiple sessions can use. An agent deploys version 1.0 in one session, then deploys version 1.1 in a later session and tests the upgrade path. The environment retains state between sessions.

Agent task: "Test the upgrade path from v2.0 to v2.1"

Session 1 (day 1):
  1. Deploy v2.0 with production-like data
  2. Run smoke tests -> all pass
  3. Session ends, environment persists

Session 2 (day 3):
  1. Resume existing environment
  2. Apply v2.1 upgrade
  3. Run migration verification
  4. Test backward compatibility
  5. Capture upgrade runbook
  6. Environment destroyed or kept for further testing

Persistent environments are the most expensive option but the only choice for testing upgrades, data migrations, and long-running performance scenarios.

Choosing the Right Environment#

Scenario	Environment Type	Why
Validate a single manifest	Ephemeral	One step, no state needed
Test a Helm chart with dependencies	Session	Multiple steps, dependency order matters
Debug a deployment failure	Session	Iterative, needs state from previous attempts
Test an upgrade from v1 to v2	Persistent	Needs v1 state before starting v2 upgrade
Performance benchmark	Persistent	Needs sustained load over time
Quick “does this image build?”	Ephemeral	Single command, result is pass/fail

The Test-Validate-Document Cycle#

The core workflow inside a sandbox is a cycle: deploy something, validate it works, document the result. This cycle repeats for each step in the runbook.

Deploy#

Execute the step exactly as it will appear in the runbook. Use the same commands, the same flags, the same values. If the runbook says helm install db bitnami/postgresql --set auth.database=myapp, that is the exact command the agent runs in the sandbox.

Do not use shortcuts in the sandbox that the human will not have. Do not skip --wait because you know the sandbox is fast. Do not hardcode sandbox-specific values that need to change for production. The sandbox execution should be as close to the production execution as possible.

Validate#

After each deployment step, run explicit verification. Verification is not “it did not error.” Verification is positive confirmation that the expected result occurred.

# Weak verification: command succeeded
helm install db bitnami/postgresql --namespace myns
# Exit code 0 -- but is the pod actually running?

# Strong verification: positive confirmation
helm install db bitnami/postgresql --namespace myns --wait --timeout 120s
kubectl get pods -n myns -l app.kubernetes.io/name=postgresql -o jsonpath='{.items[0].status.phase}'
# "Running"

kubectl exec -it db-postgresql-0 -n myns -- psql -U postgres -c "SELECT 1;"
# 1 -- database is accepting connections

kubectl exec -it db-postgresql-0 -n myns -- psql -U myapp -d myapp -c "CREATE TABLE _test (id int); DROP TABLE _test;"
# No error -- application user has DDL permissions

The validation commands become part of the runbook. The human running the runbook in production will execute the same checks.

Document#

Record what happened: the command, the output, the time it took, and any issues encountered. If the step succeeded, record the expected output so the human knows what to look for. If the step failed, record the error, the root cause, and the fix.

### Step 2: Deploy PostgreSQL

**Command:**
helm install db bitnami/postgresql -n myns -f postgres-values.yaml --wait --timeout 120s

**Result:** Success (47 seconds)

**Verification:**
- Pod status: Running (1/1 Ready)
- Connection test: psql SELECT 1 returned successfully
- DDL test: CREATE TABLE / DROP TABLE succeeded (schema ownership is correct)

**Notes:** The --wait flag is important. Without it, the next step (application
deployment) may start before PostgreSQL is ready, causing connection refused errors.
The 120s timeout accommodates slow storage provisioning on first boot.

Adapting Sandbox Results to Production#

A sandbox is not production. Several things change between the two, and the runbook must account for these differences explicitly.

What Changes#

Endpoints. The sandbox database is at db-postgresql.myns.svc.cluster.local. Production might use an external RDS endpoint like mydb.abc123.us-east-1.rds.amazonaws.com. The runbook should use variables or placeholders for endpoints, not hardcoded sandbox values.

Credentials. The sandbox uses throwaway passwords. Production uses credentials from a secret manager. The runbook should reference the credential source, not embed passwords.

# Sandbox (for testing -- NOT for production use)
auth:
  password: "sandbox-testing-only"

# Production (reference secret manager)
auth:
  existingSecret: "myapp-db-credentials"
  secretKeys:
    userPasswordKey: "password"

Scale. The sandbox runs one replica with minimal resources. Production needs multiple replicas, appropriate resource requests, pod disruption budgets, and autoscaling. The runbook should include a “production adjustments” section.

# Sandbox
replicaCount: 1
resources:
  requests:
    cpu: 100m
    memory: 256Mi

# Production adjustments
replicaCount: 3
resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: "2"
    memory: 4Gi
podDisruptionBudget:
  minAvailable: 2

Networking. The sandbox may have permissive network policies or no policies at all. Production likely has deny-all defaults with explicit allow rules. The runbook should include the NetworkPolicy manifests tested in the sandbox, with notes about which ports and protocols are required.

TLS and certificates. The sandbox may use self-signed certificates or no TLS at all. Production needs valid certificates from a trusted CA. The runbook should include cert-manager ClusterIssuer references or manual certificate installation steps.

What Stays the Same#

The operational steps stay the same. The order of operations stays the same. The verification checks stay the same (with adapted endpoints). The rollback procedures stay the same. The known issues and their resolutions stay the same.

This is the value of sandbox testing: the agent proves that the operational sequence works, and the human adapts the configuration values for their environment. The hard part – getting the sequence right, discovering dependency ordering, hitting and resolving errors – is done in the sandbox.

The Adaptation Checklist#

Every runbook should include an explicit section listing what the human must change for production:

## Production Adaptation Checklist

Before executing this runbook on production, update the following:

- [ ] **Namespace:** Change from `sandbox-test` to your target namespace
- [ ] **Credentials:** Replace sandbox passwords with production credentials
      from your secret manager (Vault, AWS Secrets Manager, etc.)
- [ ] **Database endpoint:** If using managed database (RDS, Cloud SQL),
      update connection strings from in-cluster service to external endpoint
- [ ] **Resource requests:** Increase CPU/memory to production values
      (sandbox: 100m/256Mi, suggested production: 500m/1Gi minimum)
- [ ] **Replicas:** Increase from 1 to your HA requirements (minimum 2
      for stateless services, 3 for stateful)
- [ ] **Storage class:** Verify the StorageClass name matches your cluster
      (sandbox used `standard`, production may use `gp3`, `premium-rwo`, etc.)
- [ ] **Ingress hostname:** Update from sandbox hostname to production domain
- [ ] **TLS:** Configure production certificate (cert-manager issuer or
      manual certificate)
- [ ] **Network policies:** Verify the included NetworkPolicy manifests
      match your cluster's policy model

Handling Sandbox Limitations#

Sandboxes cannot perfectly replicate production. Acknowledging this honestly in the runbook is better than pretending the sandbox test was equivalent to a production test.

No Real Traffic#

A sandbox does not receive production traffic patterns. The runbook can verify that an application starts and serves health checks, but it cannot verify behavior under load. The runbook should note this gap:

## Limitations of Sandbox Testing

**Load testing not included.** This runbook was validated with synthetic
requests only. Before exposing to production traffic, consider:
- Running a load test with your expected traffic profile
- Verifying that HPA (Horizontal Pod Autoscaler) triggers at the
  configured thresholds
- Testing database connection pool exhaustion under concurrent load

Limited Scale#

A sandbox typically runs on a single node with limited resources. Multi-node behaviors – pod anti-affinity, zone-aware scheduling, node failures – cannot be tested. The runbook should flag these:

**Single-node sandbox.** The following were not tested in this sandbox:
- Pod anti-affinity rules (require multiple nodes)
- Zone-aware topology spread constraints
- Node failure recovery and pod rescheduling
- PodDisruptionBudget behavior during rolling updates

No Production Data#

A sandbox uses synthetic or empty data. Schema migrations, data volume effects on query performance, and backup/restore procedures with real data cannot be validated. The runbook should recommend a staged rollout:

**No production data tested.** Recommended validation sequence on production:
1. Deploy to a staging namespace with a copy of production data
2. Run the schema migration and verify data integrity
3. Test the application against real data patterns
4. Only then deploy to the production namespace

Cloud Service Differences#

Even cloud-native sandboxes have differences from production. A sandbox EKS cluster uses a new, clean AWS account. Production EKS sits in an account with years of accumulated IAM policies, VPC configurations, and security group rules. The runbook should note where cloud account state might affect behavior:

**Clean cloud account.** The sandbox used a fresh AWS account with no
pre-existing resources. In your production account, verify:
- VPC CIDR ranges do not conflict with the proposed subnet allocation
- Security groups allow the required port ranges
- IAM role trust policies reference the correct OIDC provider
- Service quotas are sufficient (EKS cluster limit, EC2 instance limits)

The Handoff Point#

The handoff is the most important moment in the workflow. The agent has tested, documented, and packaged the deliverable. Now it transfers responsibility to the human.

The handoff is a clear boundary: the agent produces the deliverable, the human (or their own automation) executes it on their infrastructure. The agent never touches their systems. This boundary exists for practical reasons – liability, security, and trust – but it also produces better outcomes. A human reviewing a runbook before executing it catches environment-specific issues that no sandbox can replicate.

What the Handoff Package Contains#

A complete handoff package includes:

The runbook itself. In markdown for human review and structured YAML/JSON for machine consumption.

All configuration files. Helm values, Kubernetes manifests, Terraform files, shell scripts – everything needed to execute the runbook. Not code snippets in the runbook text, but actual files ready to apply.

Sandbox test results. Evidence that the runbook was executed successfully. Which steps passed, what errors were encountered, how they were resolved. For enterprise customers, this evidence supports change advisory board (CAB) review.

The adaptation checklist. An explicit list of what must change for the human’s specific environment.

Version metadata. Exactly which versions of every tool, chart, image, and API were tested. When the human’s environment differs, they know which runbook steps to scrutinize.

handoff-package/
  README.md                    # Human-readable runbook
  runbook.yaml                 # Machine-readable runbook
  config/
    postgres-values.yaml       # Helm values for PostgreSQL
    app-values.yaml            # Helm values for application
    network-policy.yaml        # NetworkPolicy manifest
    ingress.yaml               # Ingress manifest
  scripts/
    deploy.sh                  # Executable deployment script
    rollback.sh                # Executable rollback script
    verify.sh                  # Verification checks
  test-results/
    sandbox-log.txt            # Full sandbox execution log
    test-summary.yaml          # Structured test results
  adaptation-checklist.md      # What to change for production

The Enterprise Handoff#

For enterprise customers, the handoff package serves a specific purpose: it is the evidence artifact for change management. An enterprise with a change advisory board (CAB) needs to show that the proposed change was tested, that the test passed, that rollback procedures exist, and that the risks are documented.

The sandbox-to-runbook model fits this workflow precisely. The agent tests in the sandbox (analogous to a staging environment). The runbook is the change request document. The sandbox test results are the evidence. The CAB reviews the package, approves the change, and the operations team executes the runbook on production.

This is why the boundary matters. “We tested this on a real EKS cluster in an isolated sandbox and here are the results” passes procurement and security review. “We will run this directly on your production cluster” does not.

After the Handoff#

The agent’s job is done when the handoff package is delivered. What happens next is the human’s responsibility:

Human reviews the runbook. They read through each step, check the adaptation checklist, verify that prerequisites match their environment.
Human adapts configuration. They update credentials, endpoints, resource requests, and storage classes for their specific environment.
Human executes on their infrastructure. They follow the runbook step by step, using the verification checks to confirm each step succeeded.
If something fails, the human has the rollback procedures. Every step in the runbook includes rollback instructions. The human can revert to a known-good state.

If the human encounters an issue the runbook did not anticipate, that is a feedback signal. The ideal workflow feeds this information back to improve future runbooks: the gap that was not caught in the sandbox becomes a new item on the pre-flight checklist, and the next runbook for a similar deployment will check for it.

The Complete Workflow in Summary#

1. Human requests: "Deploy X on my infrastructure"

2. Agent gathers requirements
   - Target environment (cloud, k8s version, architecture)
   - Existing infrastructure
   - Constraints and success criteria

3. Agent runs pre-flight checklist
   - Detects knowledge gaps
   - Verifies assumptions
   - Adjusts plan based on findings

4. Agent selects sandbox type
   - Ephemeral for simple validation
   - Session for multi-step deployment
   - Persistent for upgrade testing

5. Agent executes in sandbox
   - Deploys step by step
   - Validates after each step
   - Encounters and resolves errors
   - Documents everything

6. Agent packages deliverable
   - Runbook (human + machine readable)
   - Configuration files
   - Test results
   - Adaptation checklist
   - Version metadata

7. Agent hands off to human
   - Human reviews runbook
   - Human adapts for their environment
   - Human executes on their infrastructure
   - Agent never touches their systems

Each phase builds on the previous one. Requirements inform the plan. The plan is executed in the sandbox. The sandbox execution produces the runbook. The runbook is adapted for production. The human executes the adapted runbook. At no point does the agent act directly on production infrastructure.

This is the sandbox-to-production workflow. It is longer than “just run this command.” It is also the reason the deliverable works when the human actually uses it.