From Empty Cluster to Production-Ready#

This is the definitive operational plan for taking a fresh Kubernetes cluster and making it production-ready. Each phase builds on the previous one, with verification steps between phases and rollback notes where applicable. An agent should be able to follow this sequence end-to-end.

Estimated timeline: 5 days for a single operator. Phases 1-2 are blocking prerequisites. Phases 3-6 can partially overlap.


Phase 1 – Foundation (Day 1)#

Everything else depends on a healthy cluster with proper namespacing and storage. Do not proceed until every verification step passes.

Step 1: Verify Cluster Health#

# All nodes should be Ready
kubectl get nodes -o wide

# All system pods should be Running or Completed
kubectl get pods -n kube-system

# Verify cluster version
kubectl version --short

# Test DNS resolution from inside the cluster
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup kubernetes.default.svc.cluster.local

If any node shows NotReady, check kubelet logs on that node before proceeding. A flaky foundation wastes days later.

Step 2: Set Up RBAC#

# Create a cluster-admin binding for the ops team (use sparingly)
kubectl create clusterrolebinding ops-team-admin \
  --clusterrole=cluster-admin \
  --group=ops-team

# Create team-specific namespace roles
cat <<'EOF' | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: app-developer
  namespace: app-production
rules:
- apiGroups: ["", "apps", "batch"]
  resources: ["pods", "deployments", "services", "configmaps", "jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
  resources: ["pods/log", "pods/exec"]
  verbs: ["get", "create"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "list"]
EOF

# Create service accounts for CI/CD
kubectl create serviceaccount deploy-bot -n app-production

Step 3: Configure Namespace Strategy#

# Create core namespaces
for ns in app-production app-staging monitoring ingress-system cert-manager; do
  kubectl create namespace $ns --dry-run=client -o yaml | kubectl apply -f -
done

# Apply ResourceQuotas to production
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: app-production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"
    services: "20"
    persistentvolumeclaims: "30"
EOF

# Apply LimitRange defaults
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: app-production
spec:
  limits:
  - default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
    type: Container
EOF

Step 4: Set Up Storage Classes#

# Check existing storage classes
kubectl get storageclass

# Create a fast SSD class (example for AWS EBS)
cat <<'EOF' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "5000"
  throughput: "250"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF

# Verify default storage class exists (exactly one should be marked default)
kubectl get storageclass -o custom-columns=NAME:.metadata.name,DEFAULT:".metadata.annotations.storageclass\.kubernetes\.io/is-default-class"

Phase 1 Verification#

kubectl get nodes                              # all Ready
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded  # should be empty
kubectl get resourcequota -n app-production    # quota applied
kubectl get limitrange -n app-production       # limits applied
kubectl get storageclass                       # default + fast-ssd present

Rollback: Phase 1 resources are foundational. If something is wrong, fix it in place rather than tearing down. Delete and recreate namespaces if quota/limit changes are needed.


Phase 2 – Networking (Day 1-2)#

Step 6: Verify CNI and Network Policies#

# Identify CNI plugin
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|weave'

# If CNI does not support network policies (e.g., flannel), install Calico
# Decision point: if using a managed cluster (EKS, GKE, AKS), the CNI is usually pre-installed

Step 7: Install Ingress Controller#

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-system \
  --set controller.replicaCount=2 \
  --set controller.resources.requests.cpu=100m \
  --set controller.resources.requests.memory=128Mi \
  --set controller.resources.limits.cpu=500m \
  --set controller.resources.limits.memory=512Mi \
  --set controller.metrics.enabled=true \
  --set controller.podAntiAffinity.type=hard

Step 8: Configure cert-manager#

helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --set crds.enabled=true \
  --set resources.requests.cpu=50m \
  --set resources.requests.memory=64Mi

# Wait for cert-manager pods
kubectl wait --for=condition=Ready pods -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=120s

# Create ClusterIssuer for Let's Encrypt
cat <<'EOF' | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
    - http01:
        ingress:
          class: nginx
EOF

Step 9: Configure external-dns (Optional)#

# Only if you want automatic DNS record management
helm repo add external-dns https://kubernetes-sigs.github.io/external-dns
helm install external-dns external-dns/external-dns \
  --namespace ingress-system \
  --set provider=aws \
  --set domainFilters[0]=example.com \
  --set policy=sync \
  --set txtOwnerId=my-cluster

Phase 2 Verification#

# Create a test ingress to verify TLS
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: test-ingress
  namespace: app-staging
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - test.example.com
    secretName: test-tls
  rules:
  - host: test.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: test-svc
            port:
              number: 80
EOF

# Check certificate was issued
kubectl get certificate -n app-staging
kubectl describe certificate test-tls -n app-staging

# Clean up test resources
kubectl delete ingress test-ingress -n app-staging

Rollback: helm uninstall ingress-nginx -n ingress-system and helm uninstall cert-manager -n cert-manager. Delete CRDs manually with kubectl delete -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.crds.yaml.


Phase 3 – Observability (Day 2-3)#

Step 11: Install metrics-server#

# Required for kubectl top and HPA
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Wait and verify
kubectl wait --for=condition=Ready pods -l k8s-app=metrics-server -n kube-system --timeout=120s
kubectl top nodes

Step 12: Deploy Prometheus Stack#

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=fast-ssd \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set prometheus.prometheusSpec.resources.requests.cpu=500m \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi \
  --set prometheus.prometheusSpec.resources.limits.memory=4Gi \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=10Gi \
  --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=5Gi

Step 13: Deploy Loki#

helm repo add grafana https://grafana.github.io/helm-charts

helm install loki grafana/loki \
  --namespace monitoring \
  --set loki.auth_enabled=false \
  --set singleBinary.replicas=1 \
  --set singleBinary.resources.requests.cpu=200m \
  --set singleBinary.resources.requests.memory=512Mi

# Install Promtail to ship logs to Loki
helm install promtail grafana/promtail \
  --namespace monitoring \
  --set config.clients[0].url=http://loki:3100/loki/api/v1/push

Step 14-15: Configure Alerting and Dashboards#

# Alertmanager is included in kube-prometheus-stack
# Configure receivers by editing the Helm values or applying a secret

# Verify Grafana is accessible
kubectl get svc -n monitoring | grep grafana
kubectl port-forward svc/kube-prometheus-grafana 3000:80 -n monitoring &

# Default credentials: admin / prom-operator
# Add Loki as data source in Grafana: URL = http://loki:3100

Phase 3 Verification#

kubectl top nodes                                                  # metrics-server working
kubectl top pods -n monitoring                                     # pod metrics available
kubectl get servicemonitor -n monitoring                           # ServiceMonitors discovered
kubectl port-forward svc/kube-prometheus-grafana 3000:80 -n monitoring  # Grafana accessible

Rollback: helm uninstall kube-prometheus -n monitoring && helm uninstall loki -n monitoring && helm uninstall promtail -n monitoring. PVCs will remain; delete them manually if needed.


Phase 4 – Security Hardening (Day 3-4)#

Step 17: Apply Pod Security Standards#

# Enforce baseline on application namespaces, warn on restricted
kubectl label namespace app-production \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

kubectl label namespace app-staging \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted

Step 18: Install Policy Engine#

# Using Kyverno as example (lighter than Gatekeeper for most use cases)
helm repo add kyverno https://kyverno.github.io/kyverno
helm install kyverno kyverno/kyverno \
  --namespace kyverno \
  --create-namespace \
  --set replicaCount=2

# Apply baseline policies
helm install kyverno-policies kyverno/kyverno-policies \
  --namespace kyverno \
  --set podSecurityStandard=baseline \
  --set validationFailureAction=Enforce

Step 19-21: Audit, Network Policies, Secrets Encryption#

# Default deny in production namespace
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: app-production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF

# Allow DNS egress (essential -- without this, nothing resolves)
cat <<'EOF' | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: app-production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to: []
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
EOF

Phase 4 Verification#

# Test PSA enforcement: this should be rejected or warned
kubectl run priv-test --image=busybox:1.36 --restart=Never -n app-production \
  --overrides='{"spec":{"containers":[{"name":"test","image":"busybox:1.36","securityContext":{"privileged":true}}]}}'
# Expected: rejected by Pod Security Standards

# Verify network policies
kubectl get networkpolicy -n app-production

Rollback: Remove namespace labels with kubectl label namespace app-production pod-security.kubernetes.io/enforce-. Uninstall Kyverno with helm uninstall kyverno -n kyverno.


Phase 5 – GitOps and Deployment (Day 4-5)#

Step 23: Install ArgoCD#

kubectl create namespace argocd
helm repo add argo https://argoproj.github.io/argo-helm

helm install argocd argo/argo-cd \
  --namespace argocd \
  --set server.service.type=ClusterIP \
  --set server.resources.requests.cpu=100m \
  --set server.resources.requests.memory=128Mi \
  --set configs.params."server\.insecure"=true

# Get initial admin password
kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath='{.data.password}' | base64 -d

Step 24-26: Connect Repos and Configure#

# Add a git repository
argocd repo add https://github.com/org/k8s-manifests.git \
  --username git --password $GIT_TOKEN

# Create an Application
cat <<'EOF' | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/k8s-manifests.git
    targetRevision: main
    path: production
  destination:
    server: https://kubernetes.default.svc
    namespace: app-production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=false
EOF

Phase 5 Verification#

kubectl get applications -n argocd
argocd app get production-app
# Push a change to the git repo and verify ArgoCD syncs it within 3 minutes

Rollback: helm uninstall argocd -n argocd. Applications will stop syncing but existing resources remain in the cluster.


Phase 6 – Reliability (Day 5+)#

Step 28: PodDisruptionBudgets#

cat <<'EOF' | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-api-pdb
  namespace: app-production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-api
EOF

Step 29-30: Auto-scaling#

# HPA for application pods
kubectl autoscale deployment web-api -n app-production \
  --min=3 --max=10 --cpu-percent=70

# Cluster Autoscaler (cloud-specific) or Karpenter
# For EKS with Karpenter:
helm repo add karpenter https://charts.karpenter.sh
helm install karpenter karpenter/karpenter \
  --namespace kube-system \
  --set settings.clusterName=my-cluster \
  --set settings.clusterEndpoint=$(aws eks describe-cluster --name my-cluster --query "cluster.endpoint" --output text)

Step 31-32: Backup and Disaster Recovery#

# Install Velero
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --set configuration.backupStorageLocation[0].provider=aws \
  --set configuration.backupStorageLocation[0].bucket=my-cluster-backups \
  --set configuration.backupStorageLocation[0].config.region=us-east-1 \
  --set snapshotsEnabled=true

# Create a scheduled backup
velero schedule create daily-backup --schedule="0 2 * * *" --ttl 720h

# Test disaster recovery
velero backup create test-backup --include-namespaces=app-staging
# Delete a deployment in staging, then restore
kubectl delete deployment test-app -n app-staging
velero restore create --from-backup test-backup --include-namespaces=app-staging

Phase 6 Verification#

kubectl get pdb -n app-production           # PDBs in place
kubectl get hpa -n app-production           # HPA configured
velero backup get                           # backups listed
velero restore get                          # restore succeeded

Rollback: PDBs and HPAs can be deleted directly. Velero uninstall: helm uninstall velero -n velero.


Sequence Summary#

Phase Day Critical Gate
Foundation 1 All nodes Ready, DNS works, namespaces exist
Networking 1-2 Ingress serves traffic, TLS certificates issued
Observability 2-3 Metrics, logs, and alerts flowing
Security 3-4 Privileged pods blocked, network policies enforced
GitOps 4-5 Git push triggers deployment
Reliability 5+ Backup/restore tested, auto-scaling active

Do not skip phases. Each phase’s verification gates must pass before moving forward. If a verification step fails, fix it before continuing – problems compound across phases.