Kubernetes Production Readiness Checklist#
This checklist is designed for agents to audit a Kubernetes cluster before production workloads run on it. Every item includes the verification command and what a passing result looks like. Work through each category sequentially. A failing item in Cluster Health should be fixed before checking Workload Configuration.
Cluster Health#
These are non-negotiable. If any of these fail, stop and fix them before evaluating anything else.
All nodes in Ready state#
kubectl get nodes -o widePass: Every node shows STATUS: Ready. No NotReady, SchedulingDisabled, or Unknown.
If failing: Check kubelet logs on the affected node (journalctl -u kubelet -n 50). Common causes: expired certificates, disk pressure, memory pressure.
System pods healthy#
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=SucceededPass: Empty output. Every system pod is Running or Succeeded (for completed Jobs).
If failing: kubectl describe pod <pod> -n kube-system to check events. CoreDNS and kube-proxy failures are critical blockers.
DNS resolution working#
# Pod-to-service resolution
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup kubernetes.default.svc.cluster.local
# Pod-to-external resolution
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup google.comPass: Both resolve with valid IP addresses. Internal resolution returns the cluster service IP. External resolution returns a public IP.
If failing: Check CoreDNS pods and ConfigMap. See the DNS debugging knowledge article for detailed troubleshooting.
Cluster version matches target#
kubectl version --shortPass: Server version matches your target release (e.g., v1.29.x). Not running an alpha, beta, or end-of-life version. The version is within the supported window (N-2 minor releases from latest stable).
etcd health verified#
# On managed clusters (EKS, GKE, AKS), etcd is managed by the provider -- skip this
# On self-managed clusters:
kubectl exec -n kube-system etcd-<node-name> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint healthPass: 127.0.0.1:2379 is healthy: successfully committed proposal.
Workload Configuration#
Check every Deployment, StatefulSet, and DaemonSet that will run production traffic.
All containers have resource requests AND limits#
# Find containers without requests or limits
kubectl get pods -A -o json | jq -r '
.items[] |
.metadata.namespace as $ns |
.metadata.name as $pod |
.spec.containers[] |
select(.resources.requests == null or .resources.limits == null) |
"\($ns)/\($pod)/\(.name): missing requests or limits"'Pass: Empty output. Every container has both requests.cpu, requests.memory, limits.cpu, and limits.memory set.
Why it matters: Without requests, the scheduler cannot make placement decisions. Without limits, a single pod can consume all node resources.
Liveness and readiness probes configured#
kubectl get pods -A -o json | jq -r '
.items[] |
.metadata.namespace as $ns |
.metadata.name as $pod |
.spec.containers[] |
select(.readinessProbe == null or .livenessProbe == null) |
"\($ns)/\($pod)/\(.name): missing probes"'Pass: No application containers listed. System containers (kube-proxy, CNI agents) may legitimately lack probes.
Common mistake: Setting livenessProbe and readinessProbe to the same endpoint and timing. The liveness probe should be more lenient (higher failureThreshold) because a liveness failure restarts the container.
Pod anti-affinity for multi-replica deployments#
kubectl get deployments -A -o json | jq -r '
.items[] |
select(.spec.replicas > 1) |
select(.spec.template.spec.affinity.podAntiAffinity == null) |
"\(.metadata.namespace)/\(.metadata.name): replicas=\(.spec.replicas) but no pod anti-affinity"'Pass: No multi-replica deployments without anti-affinity. All replicas should spread across nodes.
PodDisruptionBudgets for critical services#
# List deployments with no corresponding PDB
kubectl get pdb -A -o json | jq -r '.items[].spec.selector.matchLabels' > /tmp/pdb-selectors.json
kubectl get deployments -A -o json | jq -r '
.items[] |
select(.spec.replicas > 1) |
"\(.metadata.namespace)/\(.metadata.name)"'
# Cross-reference manually: every multi-replica deployment should have a PDBPass: Every critical multi-replica deployment has a PDB with minAvailable or maxUnavailable set.
Graceful shutdown handling#
kubectl get deployments -A -o json | jq -r '
.items[] |
select(.spec.template.spec.terminationGracePeriodSeconds == null or
.spec.template.spec.terminationGracePeriodSeconds == 30) |
"\(.metadata.namespace)/\(.metadata.name): using default terminationGracePeriodSeconds (30s)"'Pass: Critical services have a terminationGracePeriodSeconds appropriate for their shutdown behavior. Services with long-running connections or background jobs need longer than the 30s default.
Image tags are pinned (no :latest)#
kubectl get pods -A -o json | jq -r '
.items[] |
.metadata.namespace as $ns |
.metadata.name as $pod |
.spec.containers[] |
select(.image | test(":latest$") or (test(":") | not)) |
"\($ns)/\($pod): \(.image)"'Pass: Empty output. Every image uses a specific tag or digest (e.g., nginx:1.25.3 or nginx@sha256:...). Never :latest in production.
Images from trusted registries only#
kubectl get pods -A -o json | jq -r '
.items[] |
.spec.containers[] |
.image' | sort -u | grep -v -E '^(registry\.company\.com|gcr\.io/my-project|[0-9]+\.dkr\.ecr)'Pass: All images come from your organization’s approved registries. No images from Docker Hub public repositories in production.
Security#
RBAC configured (no unnecessary cluster-admin)#
kubectl get clusterrolebindings -o json | jq -r '
.items[] |
select(.roleRef.name == "cluster-admin") |
"\(.metadata.name): \(.subjects // [] | map(.name) | join(", "))"'Pass: Only system accounts and a single ops-team binding have cluster-admin. No individual user accounts, no CI/CD service accounts with cluster-admin.
Pod Security Standards enforced#
kubectl get namespaces -o json | jq -r '
.items[] |
select(.metadata.labels["pod-security.kubernetes.io/enforce"] != null) |
"\(.metadata.name): enforce=\(.metadata.labels["pod-security.kubernetes.io/enforce"])"'Pass: All application namespaces have at least baseline enforcement. The restricted level is applied as warn or audit on production namespaces.
Network policies in place (default deny)#
kubectl get networkpolicy -A -o json | jq -r '
.items[] |
select(.spec.podSelector == {} or .spec.podSelector.matchLabels == null) |
"\(.metadata.namespace)/\(.metadata.name)"'Pass: Every application namespace has a default-deny NetworkPolicy with an empty podSelector.
Service accounts not using default SA#
kubectl get pods -A -o json | jq -r '
.items[] |
select(.metadata.namespace != "kube-system") |
select(.spec.serviceAccountName == "default" or .spec.serviceAccountName == null) |
"\(.metadata.namespace)/\(.metadata.name): using default ServiceAccount"'Pass: No application pods use the default ServiceAccount. Each workload has its own SA with minimal permissions.
Containers run as non-root#
kubectl get pods -A -o json | jq -r '
.items[] |
.metadata.namespace as $ns |
.metadata.name as $pod |
.spec.containers[] |
select(.securityContext.runAsNonRoot != true) |
select(.securityContext.runAsUser == null or .securityContext.runAsUser == 0) |
"\($ns)/\($pod)/\(.name): may run as root"'Pass: All application containers have runAsNonRoot: true or an explicit non-zero runAsUser.
Read-only root filesystem where possible#
kubectl get pods -A -o json | jq -r '
.items[] |
.metadata.namespace as $ns |
.metadata.name as $pod |
.spec.containers[] |
select(.securityContext.readOnlyRootFilesystem != true) |
"\($ns)/\($pod)/\(.name): writable root filesystem"'Pass: Most application containers use readOnlyRootFilesystem: true with emptyDir mounts for any paths that need writes (e.g., /tmp).
Networking#
Ingress TLS configured#
kubectl get ingress -A -o json | jq -r '
.items[] |
select(.spec.tls == null or (.spec.tls | length) == 0) |
"\(.metadata.namespace)/\(.metadata.name): no TLS configured"'Pass: Every Ingress resource has a tls section with valid secret references.
cert-manager auto-renewal working#
kubectl get certificates -A -o json | jq -r '
.items[] |
"\(.metadata.namespace)/\(.metadata.name): ready=\(.status.conditions[] | select(.type=="Ready") | .status) renewal=\(.status.renewalTime)"'Pass: All certificates show Ready=True and renewalTime is in the future.
Load balancer health checks configured#
kubectl get svc -A -o json | jq -r '
.items[] |
select(.spec.type == "LoadBalancer") |
"\(.metadata.namespace)/\(.metadata.name): externalTrafficPolicy=\(.spec.externalTrafficPolicy)"'Pass: Load balancer services exist and have appropriate externalTrafficPolicy (usually Local for preserving source IP).
Observability#
Metrics collection working#
kubectl top nodes
kubectl top pods -n app-productionPass: Both commands return current CPU and memory usage data. If metrics-server is not installed, both commands will fail.
Logging pipeline functional#
# Check that log collector DaemonSet is running on all nodes
kubectl get daemonset -n monitoring -l app.kubernetes.io/name=promtail
# Or for fluent-bit:
kubectl get daemonset -n monitoring -l app.kubernetes.io/name=fluent-bit
# Verify logs are queryable in Loki/your logging backend
# Through Grafana Explore: query {namespace="app-production"} and confirm results appearPass: DaemonSet has desired=available on all nodes. Recent logs are queryable.
Alerting configured and tested#
# Check Alertmanager has receivers configured
kubectl get secret alertmanager-kube-prometheus-alertmanager -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d | head -20
# Check for active alerts
kubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring &
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | headPass: Alertmanager config shows at least one non-default receiver (Slack, PagerDuty, etc.). Test alert was received by the team.
Reliability#
Backup strategy implemented and tested#
# Check Velero status
velero backup get
velero schedule getPass: At least one scheduled backup exists. The most recent backup status is Completed. A restore test has been performed in the last 30 days.
Horizontal auto-scaling configured#
kubectl get hpa -APass: Critical workloads have HPA configured with appropriate min/max replicas and target metrics.
Node auto-scaling configured#
# Check Cluster Autoscaler or Karpenter
kubectl get pods -n kube-system | grep -E 'cluster-autoscaler|karpenter'Pass: Node auto-scaler is running and configured for the node groups that host application workloads.
Operations#
GitOps or deployment pipeline working#
kubectl get applications -n argocd # ArgoCD
# or
flux get kustomizations # FluxPass: Applications are synced and healthy. Last sync was recent (within expected interval).
Rollback procedure tested#
# Verify rollback capability
kubectl rollout history deployment/<name> -n app-productionPass: Deployment revision history exists with at least 2 entries. Team has documented and tested the rollback procedure. revisionHistoryLimit is set to a reasonable number (10 is the default).
Scoring#
Count passing items from the checklist above:
| Score | Assessment |
|---|---|
| 30+ items pass | Production ready |
| 25-29 items pass | Near ready – address gaps before launch |
| 20-24 items pass | Significant gaps – schedule a hardening sprint |
| Below 20 | Not production ready – major work required |
Generate a report listing every failing item, its risk level (critical/high/medium/low), and the specific remediation step. Critical items (cluster health, RBAC, network policies) must be fixed before go-live. High items should be fixed within the first week. Medium and low items can be tracked in a backlog.