Setting Up Multi-Environment Infrastructure: Dev, Staging, and Production#

Running a single environment is straightforward. Running three that drift apart silently is where teams lose weeks debugging “it works in dev.” This operational sequence walks through setting up dev, staging, and production environments that stay consistent where it matters and diverge only where intended.

Phase 1 – Environment Strategy#

Step 1: Define Environments#

Each environment serves a distinct purpose:

  • Dev: Rapid iteration. Developers deploy frequently, break things, and recover quickly. Data is disposable. Resources are minimal.
  • Staging: Production mirror. Same Kubernetes version, same network policies, same resource quotas. External services point to staging endpoints. Used for integration testing and pre-release validation.
  • Production: Real users, real data. Changes go through approval gates. Monitoring is comprehensive and alerting reaches on-call engineers.

Step 2: Isolation Model#

Decision point: Separate clusters per environment versus namespaces in a shared cluster.

Approach Isolation Cost Management Overhead
Separate clusters Strong – independent control planes, no blast radius Higher – three sets of control plane costs, node pools Higher – three clusters to upgrade, monitor, and patch
Namespaces in shared cluster Weaker – noisy neighbor risk, shared control plane Lower – single cluster cost Lower – one cluster to manage
Hybrid Strong for production, moderate for dev/staging Medium Medium

Recommendation: Use namespaces for dev and staging in a shared non-production cluster. Use a separate cluster for production. This gives you cost efficiency for non-production work and strong isolation where it matters most.

Step 3: Define Environment Parity#

What must match production:

  • Kubernetes version (within one minor version)
  • Network policy enforcement (same CNI plugin)
  • Resource quotas and limit ranges (staging should mirror production quotas)
  • Pod security standards
  • Ingress configuration pattern (same controller, different hostnames)

What can differ:

  • Replica count (1 in dev, 2 in staging, 3+ in production)
  • Node count and instance types
  • External service endpoints (staging databases, staging payment processors)
  • Secret values (different passwords, different API keys)
  • TLS certificates (self-signed in dev, staging CA in staging, public CA in production)

Step 4: Output#

Document the decisions from steps 1-3 in an environment architecture document. This becomes the reference for every infrastructure change going forward.

Phase 2 – Infrastructure as Code#

Step 5: Terraform Module Structure#

infrastructure/
  modules/
    vpc/
      main.tf
      variables.tf
      outputs.tf
    kubernetes-cluster/
      main.tf
      variables.tf
      outputs.tf
    database/
      main.tf
      variables.tf
      outputs.tf
  environments/
    dev/
      main.tf
      dev.tfvars
      backend.tf
    staging/
      main.tf
      staging.tfvars
      backend.tf
    prod/
      main.tf
      prod.tfvars
      backend.tf

Modules contain reusable logic. Environment directories compose modules with environment-specific variables.

Step 6: Environment Variables#

# environments/dev/dev.tfvars
environment     = "dev"
cluster_name    = "myapp-dev"
node_count      = 2
node_type       = "t3.medium"
db_instance     = "db.t3.small"
db_multi_az     = false
enable_waf      = false
# environments/staging/staging.tfvars
environment     = "staging"
cluster_name    = "myapp-staging"
node_count      = 3
node_type       = "t3.large"
db_instance     = "db.t3.medium"
db_multi_az     = false
enable_waf      = true
# environments/prod/prod.tfvars
environment     = "prod"
cluster_name    = "myapp-prod"
node_count      = 5
node_type       = "t3.xlarge"
db_instance     = "db.r6g.large"
db_multi_az     = true
enable_waf      = true

Step 7: State Management#

Each environment gets its own state file. Never share state between environments. A bad terraform destroy in dev should have zero possibility of touching production resources.

# environments/dev/backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "dev/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
# environments/prod/backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Step 8: Provisioning#

Apply each environment independently:

cd infrastructure/environments/dev
terraform init
terraform plan -var-file=dev.tfvars -out=plan.out
terraform apply plan.out

Provision in order: VPC/network first, then the Kubernetes cluster, then databases and storage, then DNS records. Terraform handles this ordering through resource dependencies if everything is in the same configuration. If split across modules, use terraform_remote_state data sources or Terragrunt for cross-module references.

Step 9: Verification#

Run terraform plan for each environment and review the output. Confirm that dev creates smaller instances, staging mirrors production structure with fewer replicas, and production includes multi-AZ databases and WAF rules.

Phase 3 – Kubernetes Configuration#

Step 10: Kustomize Overlays#

# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
spec:
  selector:
    matchLabels:
      app: my-service
  template:
    metadata:
      labels:
        app: my-service
    spec:
      containers:
        - name: my-service
          image: my-service:latest
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
# overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: my-service-dev
resources:
  - ../../base
patches:
  - patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: my-service
      spec:
        replicas: 1
        template:
          spec:
            containers:
              - name: my-service
                resources:
                  requests:
                    cpu: 100m
                    memory: 128Mi
                  limits:
                    cpu: 250m
                    memory: 256Mi
                env:
                  - name: LOG_LEVEL
                    value: debug
images:
  - name: my-service
    newName: ghcr.io/myorg/my-service
    newTag: dev-latest
# overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: my-service-prod
resources:
  - ../../base
patches:
  - patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: my-service
      spec:
        replicas: 3
        template:
          spec:
            containers:
              - name: my-service
                resources:
                  requests:
                    cpu: 500m
                    memory: 512Mi
                  limits:
                    cpu: "1"
                    memory: 1Gi
                env:
                  - name: LOG_LEVEL
                    value: warn
images:
  - name: my-service
    newName: ghcr.io/myorg/my-service
    newTag: v1.2.3

Step 11: What Varies vs. What Stays Constant#

Keep the following identical across environments: container spec (same Dockerfile, same entrypoint), probe paths and ports, label selectors, volume mount paths, and service port definitions. These define what the application is.

Vary per environment: replica count, resource requests and limits, ingress hostnames, image tags, environment variables (log levels, feature flags), and external service URLs. These define how the application runs.

Step 12: Namespace-Level Controls#

# overlays/dev/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 2Gi
    limits.cpu: "4"
    limits.memory: 4Gi
    pods: "20"
# overlays/prod/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: prod-quota
spec:
  hard:
    requests.cpu: "16"
    requests.memory: 32Gi
    limits.cpu: "32"
    limits.memory: 64Gi
    pods: "100"

Step 13: Verification#

diff <(kustomize build overlays/dev) <(kustomize build overlays/prod)

The diff should show only expected differences: namespace name, replica count, resource amounts, image tag, environment variables, and ingress hostname. If you see unexpected differences in probe configuration, port numbers, or container commands, the overlays are diverging too much from the base.

Phase 4 – Secrets Management#

Step 14: Per-Environment Secret Stores#

Use separate paths in your secret store for each environment. In HashiCorp Vault:

secret/dev/my-service/database-url
secret/staging/my-service/database-url
secret/prod/my-service/database-url

With the Kubernetes External Secrets Operator, each environment references its own path:

# overlays/dev/external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: my-service-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: my-service-secrets
  data:
    - secretKey: DATABASE_URL
      remoteRef:
        key: secret/dev/my-service
        property: database-url

Step 15: Secret Strategy Per Environment#

  • Dev: Use local or fake secrets. Database passwords can be simple. TLS uses self-signed certificates. External API keys point to sandbox endpoints. Developers can read secrets for debugging.
  • Staging: Production-like secrets but targeting staging endpoints. Real TLS certificates from a staging CA. Access restricted to CI systems and senior engineers.
  • Production: Real secrets with managed rotation. TLS from public CAs. Access restricted to automated systems. Human access requires break-glass procedures with audit logging.

Step 16: Verification#

Deploy the application in each environment and verify it connects to the correct database, uses the correct API endpoint, and presents the correct TLS certificate. Check kubectl get secret my-service-secrets -n my-service-dev -o yaml to confirm the secret exists (do not print the values in production).

Phase 5 – Pipeline Integration#

Step 17: Dev Deployment#

Automatic on every push to main. The CI pipeline builds the image and updates the dev overlay’s image tag. ArgoCD picks up the change and deploys within seconds.

Step 18: Staging Promotion#

Triggered automatically on merge to main (if you trust your test suite) or manually. A staging promotion workflow updates the staging overlay’s image tag:

# .github/workflows/promote-staging.yaml
name: Promote to Staging
on:
  workflow_dispatch:
    inputs:
      image_tag:
        description: "Image tag to promote"
        required: true

jobs:
  promote:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          repository: myorg/deploy-configs
          token: ${{ secrets.CONFIG_REPO_TOKEN }}
      - name: Update staging image tag
        run: |
          cd apps/my-service/overlays/staging
          kustomize edit set image my-service=ghcr.io/myorg/my-service:${{ inputs.image_tag }}
      - name: Commit and push
        run: |
          git add .
          git commit -m "Promote my-service ${{ inputs.image_tag }} to staging"
          git push

Step 19: Production Deployment#

Requires a manual approval gate. Options include:

  • GitHub Environment protection rules: Configure the production environment to require approval from specific reviewers.
  • PR-based promotion: Production changes go through a pull request to a production branch, requiring code review before merge.
  • ArgoCD manual sync: Production ArgoCD Application has no automated sync policy. An operator clicks “Sync” in the ArgoCD UI or runs argocd app sync my-service-prod.

Step 20: Verification#

Run the full promotion flow: push a commit, verify it deploys to dev automatically, trigger staging promotion, verify it deploys to staging, trigger production deployment with approval, verify production runs the expected version.

# Check the running image in each environment
kubectl get pods -n my-service-dev -o jsonpath='{.items[0].spec.containers[0].image}'
kubectl get pods -n my-service-staging -o jsonpath='{.items[0].spec.containers[0].image}'
kubectl get pods -n my-service-prod -o jsonpath='{.items[0].spec.containers[0].image}'

Phase 6 – Observability Per Environment#

Step 21: Dev Observability#

Basic Prometheus and Grafana. Alerting goes to a low-priority Slack channel. Retention is short (7 days). The goal is to catch obvious errors, not to track SLOs.

Step 22: Staging Observability#

Full monitoring stack matching production. Same dashboards, same alerting rules, same log aggregation. Alerts go to a development team Slack channel. This catches monitoring gaps before they reach production.

Step 23: Production Observability#

Full stack with SLO tracking. Alerts route to PagerDuty for critical issues, Slack for warnings. Retention matches compliance requirements (30-90 days minimum). Include deployment markers on Grafana dashboards so you can correlate metric changes with deployments.

Step 24: Cross-Environment Comparison#

Create a Grafana dashboard that shows the same metrics across all three environments side by side. This makes it easy to spot when staging behavior diverges from production, or when a dev deployment causes unusual resource consumption:

# Prometheus recording rule for cross-environment comparison
groups:
  - name: cross-environment
    rules:
      - record: service:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, environment))

Summary#

The multi-environment setup follows a clear progression: decide on isolation strategy, provision infrastructure with Terraform, configure Kubernetes resources with Kustomize overlays, manage secrets per environment, wire up promotion pipelines with gates between stages, and instrument each environment with appropriate observability. Every phase has a verification step that confirms the environment works correctly before proceeding to the next.