Setting Up Multi-Environment Infrastructure: Dev, Staging, and Production#
Running a single environment is straightforward. Running three that drift apart silently is where teams lose weeks debugging “it works in dev.” This operational sequence walks through setting up dev, staging, and production environments that stay consistent where it matters and diverge only where intended.
Phase 1 – Environment Strategy#
Step 1: Define Environments#
Each environment serves a distinct purpose:
- Dev: Rapid iteration. Developers deploy frequently, break things, and recover quickly. Data is disposable. Resources are minimal.
- Staging: Production mirror. Same Kubernetes version, same network policies, same resource quotas. External services point to staging endpoints. Used for integration testing and pre-release validation.
- Production: Real users, real data. Changes go through approval gates. Monitoring is comprehensive and alerting reaches on-call engineers.
Step 2: Isolation Model#
Decision point: Separate clusters per environment versus namespaces in a shared cluster.
| Approach | Isolation | Cost | Management Overhead |
|---|---|---|---|
| Separate clusters | Strong – independent control planes, no blast radius | Higher – three sets of control plane costs, node pools | Higher – three clusters to upgrade, monitor, and patch |
| Namespaces in shared cluster | Weaker – noisy neighbor risk, shared control plane | Lower – single cluster cost | Lower – one cluster to manage |
| Hybrid | Strong for production, moderate for dev/staging | Medium | Medium |
Recommendation: Use namespaces for dev and staging in a shared non-production cluster. Use a separate cluster for production. This gives you cost efficiency for non-production work and strong isolation where it matters most.
Step 3: Define Environment Parity#
What must match production:
- Kubernetes version (within one minor version)
- Network policy enforcement (same CNI plugin)
- Resource quotas and limit ranges (staging should mirror production quotas)
- Pod security standards
- Ingress configuration pattern (same controller, different hostnames)
What can differ:
- Replica count (1 in dev, 2 in staging, 3+ in production)
- Node count and instance types
- External service endpoints (staging databases, staging payment processors)
- Secret values (different passwords, different API keys)
- TLS certificates (self-signed in dev, staging CA in staging, public CA in production)
Step 4: Output#
Document the decisions from steps 1-3 in an environment architecture document. This becomes the reference for every infrastructure change going forward.
Phase 2 – Infrastructure as Code#
Step 5: Terraform Module Structure#
infrastructure/
modules/
vpc/
main.tf
variables.tf
outputs.tf
kubernetes-cluster/
main.tf
variables.tf
outputs.tf
database/
main.tf
variables.tf
outputs.tf
environments/
dev/
main.tf
dev.tfvars
backend.tf
staging/
main.tf
staging.tfvars
backend.tf
prod/
main.tf
prod.tfvars
backend.tfModules contain reusable logic. Environment directories compose modules with environment-specific variables.
Step 6: Environment Variables#
# environments/dev/dev.tfvars
environment = "dev"
cluster_name = "myapp-dev"
node_count = 2
node_type = "t3.medium"
db_instance = "db.t3.small"
db_multi_az = false
enable_waf = false# environments/staging/staging.tfvars
environment = "staging"
cluster_name = "myapp-staging"
node_count = 3
node_type = "t3.large"
db_instance = "db.t3.medium"
db_multi_az = false
enable_waf = true# environments/prod/prod.tfvars
environment = "prod"
cluster_name = "myapp-prod"
node_count = 5
node_type = "t3.xlarge"
db_instance = "db.r6g.large"
db_multi_az = true
enable_waf = trueStep 7: State Management#
Each environment gets its own state file. Never share state between environments. A bad terraform destroy in dev should have zero possibility of touching production resources.
# environments/dev/backend.tf
terraform {
backend "s3" {
bucket = "myorg-terraform-state"
key = "dev/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}# environments/prod/backend.tf
terraform {
backend "s3" {
bucket = "myorg-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}Step 8: Provisioning#
Apply each environment independently:
cd infrastructure/environments/dev
terraform init
terraform plan -var-file=dev.tfvars -out=plan.out
terraform apply plan.outProvision in order: VPC/network first, then the Kubernetes cluster, then databases and storage, then DNS records. Terraform handles this ordering through resource dependencies if everything is in the same configuration. If split across modules, use terraform_remote_state data sources or Terragrunt for cross-module references.
Step 9: Verification#
Run terraform plan for each environment and review the output. Confirm that dev creates smaller instances, staging mirrors production structure with fewer replicas, and production includes multi-AZ databases and WAF rules.
Phase 3 – Kubernetes Configuration#
Step 10: Kustomize Overlays#
# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
spec:
selector:
matchLabels:
app: my-service
template:
metadata:
labels:
app: my-service
spec:
containers:
- name: my-service
image: my-service:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20# overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: my-service-dev
resources:
- ../../base
patches:
- patch: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
spec:
replicas: 1
template:
spec:
containers:
- name: my-service
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
env:
- name: LOG_LEVEL
value: debug
images:
- name: my-service
newName: ghcr.io/myorg/my-service
newTag: dev-latest# overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: my-service-prod
resources:
- ../../base
patches:
- patch: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
spec:
replicas: 3
template:
spec:
containers:
- name: my-service
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
env:
- name: LOG_LEVEL
value: warn
images:
- name: my-service
newName: ghcr.io/myorg/my-service
newTag: v1.2.3Step 11: What Varies vs. What Stays Constant#
Keep the following identical across environments: container spec (same Dockerfile, same entrypoint), probe paths and ports, label selectors, volume mount paths, and service port definitions. These define what the application is.
Vary per environment: replica count, resource requests and limits, ingress hostnames, image tags, environment variables (log levels, feature flags), and external service URLs. These define how the application runs.
Step 12: Namespace-Level Controls#
# overlays/dev/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-quota
spec:
hard:
requests.cpu: "2"
requests.memory: 2Gi
limits.cpu: "4"
limits.memory: 4Gi
pods: "20"# overlays/prod/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: prod-quota
spec:
hard:
requests.cpu: "16"
requests.memory: 32Gi
limits.cpu: "32"
limits.memory: 64Gi
pods: "100"Step 13: Verification#
diff <(kustomize build overlays/dev) <(kustomize build overlays/prod)The diff should show only expected differences: namespace name, replica count, resource amounts, image tag, environment variables, and ingress hostname. If you see unexpected differences in probe configuration, port numbers, or container commands, the overlays are diverging too much from the base.
Phase 4 – Secrets Management#
Step 14: Per-Environment Secret Stores#
Use separate paths in your secret store for each environment. In HashiCorp Vault:
secret/dev/my-service/database-url
secret/staging/my-service/database-url
secret/prod/my-service/database-urlWith the Kubernetes External Secrets Operator, each environment references its own path:
# overlays/dev/external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: my-service-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: my-service-secrets
data:
- secretKey: DATABASE_URL
remoteRef:
key: secret/dev/my-service
property: database-urlStep 15: Secret Strategy Per Environment#
- Dev: Use local or fake secrets. Database passwords can be simple. TLS uses self-signed certificates. External API keys point to sandbox endpoints. Developers can read secrets for debugging.
- Staging: Production-like secrets but targeting staging endpoints. Real TLS certificates from a staging CA. Access restricted to CI systems and senior engineers.
- Production: Real secrets with managed rotation. TLS from public CAs. Access restricted to automated systems. Human access requires break-glass procedures with audit logging.
Step 16: Verification#
Deploy the application in each environment and verify it connects to the correct database, uses the correct API endpoint, and presents the correct TLS certificate. Check kubectl get secret my-service-secrets -n my-service-dev -o yaml to confirm the secret exists (do not print the values in production).
Phase 5 – Pipeline Integration#
Step 17: Dev Deployment#
Automatic on every push to main. The CI pipeline builds the image and updates the dev overlay’s image tag. ArgoCD picks up the change and deploys within seconds.
Step 18: Staging Promotion#
Triggered automatically on merge to main (if you trust your test suite) or manually. A staging promotion workflow updates the staging overlay’s image tag:
# .github/workflows/promote-staging.yaml
name: Promote to Staging
on:
workflow_dispatch:
inputs:
image_tag:
description: "Image tag to promote"
required: true
jobs:
promote:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
repository: myorg/deploy-configs
token: ${{ secrets.CONFIG_REPO_TOKEN }}
- name: Update staging image tag
run: |
cd apps/my-service/overlays/staging
kustomize edit set image my-service=ghcr.io/myorg/my-service:${{ inputs.image_tag }}
- name: Commit and push
run: |
git add .
git commit -m "Promote my-service ${{ inputs.image_tag }} to staging"
git pushStep 19: Production Deployment#
Requires a manual approval gate. Options include:
- GitHub Environment protection rules: Configure the
productionenvironment to require approval from specific reviewers. - PR-based promotion: Production changes go through a pull request to a
productionbranch, requiring code review before merge. - ArgoCD manual sync: Production ArgoCD Application has no
automatedsync policy. An operator clicks “Sync” in the ArgoCD UI or runsargocd app sync my-service-prod.
Step 20: Verification#
Run the full promotion flow: push a commit, verify it deploys to dev automatically, trigger staging promotion, verify it deploys to staging, trigger production deployment with approval, verify production runs the expected version.
# Check the running image in each environment
kubectl get pods -n my-service-dev -o jsonpath='{.items[0].spec.containers[0].image}'
kubectl get pods -n my-service-staging -o jsonpath='{.items[0].spec.containers[0].image}'
kubectl get pods -n my-service-prod -o jsonpath='{.items[0].spec.containers[0].image}'Phase 6 – Observability Per Environment#
Step 21: Dev Observability#
Basic Prometheus and Grafana. Alerting goes to a low-priority Slack channel. Retention is short (7 days). The goal is to catch obvious errors, not to track SLOs.
Step 22: Staging Observability#
Full monitoring stack matching production. Same dashboards, same alerting rules, same log aggregation. Alerts go to a development team Slack channel. This catches monitoring gaps before they reach production.
Step 23: Production Observability#
Full stack with SLO tracking. Alerts route to PagerDuty for critical issues, Slack for warnings. Retention matches compliance requirements (30-90 days minimum). Include deployment markers on Grafana dashboards so you can correlate metric changes with deployments.
Step 24: Cross-Environment Comparison#
Create a Grafana dashboard that shows the same metrics across all three environments side by side. This makes it easy to spot when staging behavior diverges from production, or when a dev deployment causes unusual resource consumption:
# Prometheus recording rule for cross-environment comparison
groups:
- name: cross-environment
rules:
- record: service:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, environment))Summary#
The multi-environment setup follows a clear progression: decide on isolation strategy, provision infrastructure with Terraform, configure Kubernetes resources with Kustomize overlays, manage secrets per environment, wire up promotion pipelines with gates between stages, and instrument each environment with appropriate observability. Every phase has a verification step that confirms the environment works correctly before proceeding to the next.