From Empty Cluster to Production-Ready#
This is the definitive operational plan for taking a fresh Kubernetes cluster and making it production-ready. Each phase builds on the previous one, with verification steps between phases and rollback notes where applicable. An agent should be able to follow this sequence end-to-end.
Estimated timeline: 5 days for a single operator. Phases 1-2 are blocking prerequisites. Phases 3-6 can partially overlap.
Phase 1 – Foundation (Day 1)#
Everything else depends on a healthy cluster with proper namespacing and storage. Do not proceed until every verification step passes.
Step 1: Verify Cluster Health#
# All nodes should be Ready
kubectl get nodes -o wide
# All system pods should be Running or Completed
kubectl get pods -n kube-system
# Verify cluster version
kubectl version --short
# Test DNS resolution from inside the cluster
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup kubernetes.default.svc.cluster.localIf any node shows NotReady, check kubelet logs on that node before proceeding. A flaky foundation wastes days later.
Step 2: Set Up RBAC#
# Create a cluster-admin binding for the ops team (use sparingly)
kubectl create clusterrolebinding ops-team-admin \
--clusterrole=cluster-admin \
--group=ops-team
# Create team-specific namespace roles
cat <<'EOF' | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: app-developer
namespace: app-production
rules:
- apiGroups: ["", "apps", "batch"]
resources: ["pods", "deployments", "services", "configmaps", "jobs"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["pods/log", "pods/exec"]
verbs: ["get", "create"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list"]
EOF
# Create service accounts for CI/CD
kubectl create serviceaccount deploy-bot -n app-productionStep 3: Configure Namespace Strategy#
# Create core namespaces
for ns in app-production app-staging monitoring ingress-system cert-manager; do
kubectl create namespace $ns --dry-run=client -o yaml | kubectl apply -f -
done
# Apply ResourceQuotas to production
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: app-production
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
services: "20"
persistentvolumeclaims: "30"
EOF
# Apply LimitRange defaults
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: app-production
spec:
limits:
- default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
type: Container
EOFStep 4: Set Up Storage Classes#
# Check existing storage classes
kubectl get storageclass
# Create a fast SSD class (example for AWS EBS)
cat <<'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "5000"
throughput: "250"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF
# Verify default storage class exists (exactly one should be marked default)
kubectl get storageclass -o custom-columns=NAME:.metadata.name,DEFAULT:".metadata.annotations.storageclass\.kubernetes\.io/is-default-class"Phase 1 Verification#
kubectl get nodes # all Ready
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded # should be empty
kubectl get resourcequota -n app-production # quota applied
kubectl get limitrange -n app-production # limits applied
kubectl get storageclass # default + fast-ssd presentRollback: Phase 1 resources are foundational. If something is wrong, fix it in place rather than tearing down. Delete and recreate namespaces if quota/limit changes are needed.
Phase 2 – Networking (Day 1-2)#
Step 6: Verify CNI and Network Policies#
# Identify CNI plugin
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|weave'
# If CNI does not support network policies (e.g., flannel), install Calico
# Decision point: if using a managed cluster (EKS, GKE, AKS), the CNI is usually pre-installedStep 7: Install Ingress Controller#
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-system \
--set controller.replicaCount=2 \
--set controller.resources.requests.cpu=100m \
--set controller.resources.requests.memory=128Mi \
--set controller.resources.limits.cpu=500m \
--set controller.resources.limits.memory=512Mi \
--set controller.metrics.enabled=true \
--set controller.podAntiAffinity.type=hardStep 8: Configure cert-manager#
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--set crds.enabled=true \
--set resources.requests.cpu=50m \
--set resources.requests.memory=64Mi
# Wait for cert-manager pods
kubectl wait --for=condition=Ready pods -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=120s
# Create ClusterIssuer for Let's Encrypt
cat <<'EOF' | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
EOFStep 9: Configure external-dns (Optional)#
# Only if you want automatic DNS record management
helm repo add external-dns https://kubernetes-sigs.github.io/external-dns
helm install external-dns external-dns/external-dns \
--namespace ingress-system \
--set provider=aws \
--set domainFilters[0]=example.com \
--set policy=sync \
--set txtOwnerId=my-clusterPhase 2 Verification#
# Create a test ingress to verify TLS
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: test-ingress
namespace: app-staging
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- test.example.com
secretName: test-tls
rules:
- host: test.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: test-svc
port:
number: 80
EOF
# Check certificate was issued
kubectl get certificate -n app-staging
kubectl describe certificate test-tls -n app-staging
# Clean up test resources
kubectl delete ingress test-ingress -n app-stagingRollback: helm uninstall ingress-nginx -n ingress-system and helm uninstall cert-manager -n cert-manager. Delete CRDs manually with kubectl delete -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.crds.yaml.
Phase 3 – Observability (Day 2-3)#
Step 11: Install metrics-server#
# Required for kubectl top and HPA
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Wait and verify
kubectl wait --for=condition=Ready pods -l k8s-app=metrics-server -n kube-system --timeout=120s
kubectl top nodesStep 12: Deploy Prometheus Stack#
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=fast-ssd \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set prometheus.prometheusSpec.resources.requests.cpu=500m \
--set prometheus.prometheusSpec.resources.requests.memory=2Gi \
--set prometheus.prometheusSpec.resources.limits.memory=4Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=10Gi \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=5GiStep 13: Deploy Loki#
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki \
--namespace monitoring \
--set loki.auth_enabled=false \
--set singleBinary.replicas=1 \
--set singleBinary.resources.requests.cpu=200m \
--set singleBinary.resources.requests.memory=512Mi
# Install Promtail to ship logs to Loki
helm install promtail grafana/promtail \
--namespace monitoring \
--set config.clients[0].url=http://loki:3100/loki/api/v1/pushStep 14-15: Configure Alerting and Dashboards#
# Alertmanager is included in kube-prometheus-stack
# Configure receivers by editing the Helm values or applying a secret
# Verify Grafana is accessible
kubectl get svc -n monitoring | grep grafana
kubectl port-forward svc/kube-prometheus-grafana 3000:80 -n monitoring &
# Default credentials: admin / prom-operator
# Add Loki as data source in Grafana: URL = http://loki:3100Phase 3 Verification#
kubectl top nodes # metrics-server working
kubectl top pods -n monitoring # pod metrics available
kubectl get servicemonitor -n monitoring # ServiceMonitors discovered
kubectl port-forward svc/kube-prometheus-grafana 3000:80 -n monitoring # Grafana accessibleRollback: helm uninstall kube-prometheus -n monitoring && helm uninstall loki -n monitoring && helm uninstall promtail -n monitoring. PVCs will remain; delete them manually if needed.
Phase 4 – Security Hardening (Day 3-4)#
Step 17: Apply Pod Security Standards#
# Enforce baseline on application namespaces, warn on restricted
kubectl label namespace app-production \
pod-security.kubernetes.io/enforce=baseline \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
kubectl label namespace app-staging \
pod-security.kubernetes.io/enforce=baseline \
pod-security.kubernetes.io/warn=restrictedStep 18: Install Policy Engine#
# Using Kyverno as example (lighter than Gatekeeper for most use cases)
helm repo add kyverno https://kyverno.github.io/kyverno
helm install kyverno kyverno/kyverno \
--namespace kyverno \
--create-namespace \
--set replicaCount=2
# Apply baseline policies
helm install kyverno-policies kyverno/kyverno-policies \
--namespace kyverno \
--set podSecurityStandard=baseline \
--set validationFailureAction=EnforceStep 19-21: Audit, Network Policies, Secrets Encryption#
# Default deny in production namespace
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: app-production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
EOF
# Allow DNS egress (essential -- without this, nothing resolves)
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: app-production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to: []
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
EOFPhase 4 Verification#
# Test PSA enforcement: this should be rejected or warned
kubectl run priv-test --image=busybox:1.36 --restart=Never -n app-production \
--overrides='{"spec":{"containers":[{"name":"test","image":"busybox:1.36","securityContext":{"privileged":true}}]}}'
# Expected: rejected by Pod Security Standards
# Verify network policies
kubectl get networkpolicy -n app-productionRollback: Remove namespace labels with kubectl label namespace app-production pod-security.kubernetes.io/enforce-. Uninstall Kyverno with helm uninstall kyverno -n kyverno.
Phase 5 – GitOps and Deployment (Day 4-5)#
Step 23: Install ArgoCD#
kubectl create namespace argocd
helm repo add argo https://argoproj.github.io/argo-helm
helm install argocd argo/argo-cd \
--namespace argocd \
--set server.service.type=ClusterIP \
--set server.resources.requests.cpu=100m \
--set server.resources.requests.memory=128Mi \
--set configs.params."server\.insecure"=true
# Get initial admin password
kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath='{.data.password}' | base64 -dStep 24-26: Connect Repos and Configure#
# Add a git repository
argocd repo add https://github.com/org/k8s-manifests.git \
--username git --password $GIT_TOKEN
# Create an Application
cat <<'EOF' | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/k8s-manifests.git
targetRevision: main
path: production
destination:
server: https://kubernetes.default.svc
namespace: app-production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=false
EOFPhase 5 Verification#
kubectl get applications -n argocd
argocd app get production-app
# Push a change to the git repo and verify ArgoCD syncs it within 3 minutesRollback: helm uninstall argocd -n argocd. Applications will stop syncing but existing resources remain in the cluster.
Phase 6 – Reliability (Day 5+)#
Step 28: PodDisruptionBudgets#
cat <<'EOF' | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-api-pdb
namespace: app-production
spec:
minAvailable: 2
selector:
matchLabels:
app: web-api
EOFStep 29-30: Auto-scaling#
# HPA for application pods
kubectl autoscale deployment web-api -n app-production \
--min=3 --max=10 --cpu-percent=70
# Cluster Autoscaler (cloud-specific) or Karpenter
# For EKS with Karpenter:
helm repo add karpenter https://charts.karpenter.sh
helm install karpenter karpenter/karpenter \
--namespace kube-system \
--set settings.clusterName=my-cluster \
--set settings.clusterEndpoint=$(aws eks describe-cluster --name my-cluster --query "cluster.endpoint" --output text)Step 31-32: Backup and Disaster Recovery#
# Install Velero
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--set configuration.backupStorageLocation[0].provider=aws \
--set configuration.backupStorageLocation[0].bucket=my-cluster-backups \
--set configuration.backupStorageLocation[0].config.region=us-east-1 \
--set snapshotsEnabled=true
# Create a scheduled backup
velero schedule create daily-backup --schedule="0 2 * * *" --ttl 720h
# Test disaster recovery
velero backup create test-backup --include-namespaces=app-staging
# Delete a deployment in staging, then restore
kubectl delete deployment test-app -n app-staging
velero restore create --from-backup test-backup --include-namespaces=app-stagingPhase 6 Verification#
kubectl get pdb -n app-production # PDBs in place
kubectl get hpa -n app-production # HPA configured
velero backup get # backups listed
velero restore get # restore succeededRollback: PDBs and HPAs can be deleted directly. Velero uninstall: helm uninstall velero -n velero.
Sequence Summary#
| Phase | Day | Critical Gate |
|---|---|---|
| Foundation | 1 | All nodes Ready, DNS works, namespaces exist |
| Networking | 1-2 | Ingress serves traffic, TLS certificates issued |
| Observability | 2-3 | Metrics, logs, and alerts flowing |
| Security | 3-4 | Privileged pods blocked, network policies enforced |
| GitOps | 4-5 | Git push triggers deployment |
| Reliability | 5+ | Backup/restore tested, auto-scaling active |
Do not skip phases. Each phase’s verification gates must pass before moving forward. If a verification step fails, fix it before continuing – problems compound across phases.