Kubernetes Production Readiness Checklist#

This checklist is designed for agents to audit a Kubernetes cluster before production workloads run on it. Every item includes the verification command and what a passing result looks like. Work through each category sequentially. A failing item in Cluster Health should be fixed before checking Workload Configuration.


Cluster Health#

These are non-negotiable. If any of these fail, stop and fix them before evaluating anything else.

All nodes in Ready state#

kubectl get nodes -o wide

Pass: Every node shows STATUS: Ready. No NotReady, SchedulingDisabled, or Unknown.

If failing: Check kubelet logs on the affected node (journalctl -u kubelet -n 50). Common causes: expired certificates, disk pressure, memory pressure.

System pods healthy#

kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

Pass: Empty output. Every system pod is Running or Succeeded (for completed Jobs).

If failing: kubectl describe pod <pod> -n kube-system to check events. CoreDNS and kube-proxy failures are critical blockers.

DNS resolution working#

# Pod-to-service resolution
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup kubernetes.default.svc.cluster.local

# Pod-to-external resolution
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup google.com

Pass: Both resolve with valid IP addresses. Internal resolution returns the cluster service IP. External resolution returns a public IP.

If failing: Check CoreDNS pods and ConfigMap. See the DNS debugging knowledge article for detailed troubleshooting.

Cluster version matches target#

kubectl version --short

Pass: Server version matches your target release (e.g., v1.29.x). Not running an alpha, beta, or end-of-life version. The version is within the supported window (N-2 minor releases from latest stable).

etcd health verified#

# On managed clusters (EKS, GKE, AKS), etcd is managed by the provider -- skip this
# On self-managed clusters:
kubectl exec -n kube-system etcd-<node-name> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

Pass: 127.0.0.1:2379 is healthy: successfully committed proposal.


Workload Configuration#

Check every Deployment, StatefulSet, and DaemonSet that will run production traffic.

All containers have resource requests AND limits#

# Find containers without requests or limits
kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.resources.requests == null or .resources.limits == null) |
  "\($ns)/\($pod)/\(.name): missing requests or limits"'

Pass: Empty output. Every container has both requests.cpu, requests.memory, limits.cpu, and limits.memory set.

Why it matters: Without requests, the scheduler cannot make placement decisions. Without limits, a single pod can consume all node resources.

Liveness and readiness probes configured#

kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.readinessProbe == null or .livenessProbe == null) |
  "\($ns)/\($pod)/\(.name): missing probes"'

Pass: No application containers listed. System containers (kube-proxy, CNI agents) may legitimately lack probes.

Common mistake: Setting livenessProbe and readinessProbe to the same endpoint and timing. The liveness probe should be more lenient (higher failureThreshold) because a liveness failure restarts the container.

Pod anti-affinity for multi-replica deployments#

kubectl get deployments -A -o json | jq -r '
  .items[] |
  select(.spec.replicas > 1) |
  select(.spec.template.spec.affinity.podAntiAffinity == null) |
  "\(.metadata.namespace)/\(.metadata.name): replicas=\(.spec.replicas) but no pod anti-affinity"'

Pass: No multi-replica deployments without anti-affinity. All replicas should spread across nodes.

PodDisruptionBudgets for critical services#

# List deployments with no corresponding PDB
kubectl get pdb -A -o json | jq -r '.items[].spec.selector.matchLabels' > /tmp/pdb-selectors.json
kubectl get deployments -A -o json | jq -r '
  .items[] |
  select(.spec.replicas > 1) |
  "\(.metadata.namespace)/\(.metadata.name)"'
# Cross-reference manually: every multi-replica deployment should have a PDB

Pass: Every critical multi-replica deployment has a PDB with minAvailable or maxUnavailable set.

Graceful shutdown handling#

kubectl get deployments -A -o json | jq -r '
  .items[] |
  select(.spec.template.spec.terminationGracePeriodSeconds == null or
         .spec.template.spec.terminationGracePeriodSeconds == 30) |
  "\(.metadata.namespace)/\(.metadata.name): using default terminationGracePeriodSeconds (30s)"'

Pass: Critical services have a terminationGracePeriodSeconds appropriate for their shutdown behavior. Services with long-running connections or background jobs need longer than the 30s default.

Image tags are pinned (no :latest)#

kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.image | test(":latest$") or (test(":") | not)) |
  "\($ns)/\($pod): \(.image)"'

Pass: Empty output. Every image uses a specific tag or digest (e.g., nginx:1.25.3 or nginx@sha256:...). Never :latest in production.

Images from trusted registries only#

kubectl get pods -A -o json | jq -r '
  .items[] |
  .spec.containers[] |
  .image' | sort -u | grep -v -E '^(registry\.company\.com|gcr\.io/my-project|[0-9]+\.dkr\.ecr)'

Pass: All images come from your organization’s approved registries. No images from Docker Hub public repositories in production.


Security#

RBAC configured (no unnecessary cluster-admin)#

kubectl get clusterrolebindings -o json | jq -r '
  .items[] |
  select(.roleRef.name == "cluster-admin") |
  "\(.metadata.name): \(.subjects // [] | map(.name) | join(", "))"'

Pass: Only system accounts and a single ops-team binding have cluster-admin. No individual user accounts, no CI/CD service accounts with cluster-admin.

Pod Security Standards enforced#

kubectl get namespaces -o json | jq -r '
  .items[] |
  select(.metadata.labels["pod-security.kubernetes.io/enforce"] != null) |
  "\(.metadata.name): enforce=\(.metadata.labels["pod-security.kubernetes.io/enforce"])"'

Pass: All application namespaces have at least baseline enforcement. The restricted level is applied as warn or audit on production namespaces.

Network policies in place (default deny)#

kubectl get networkpolicy -A -o json | jq -r '
  .items[] |
  select(.spec.podSelector == {} or .spec.podSelector.matchLabels == null) |
  "\(.metadata.namespace)/\(.metadata.name)"'

Pass: Every application namespace has a default-deny NetworkPolicy with an empty podSelector.

Service accounts not using default SA#

kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.metadata.namespace != "kube-system") |
  select(.spec.serviceAccountName == "default" or .spec.serviceAccountName == null) |
  "\(.metadata.namespace)/\(.metadata.name): using default ServiceAccount"'

Pass: No application pods use the default ServiceAccount. Each workload has its own SA with minimal permissions.

Containers run as non-root#

kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.securityContext.runAsNonRoot != true) |
  select(.securityContext.runAsUser == null or .securityContext.runAsUser == 0) |
  "\($ns)/\($pod)/\(.name): may run as root"'

Pass: All application containers have runAsNonRoot: true or an explicit non-zero runAsUser.

Read-only root filesystem where possible#

kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.securityContext.readOnlyRootFilesystem != true) |
  "\($ns)/\($pod)/\(.name): writable root filesystem"'

Pass: Most application containers use readOnlyRootFilesystem: true with emptyDir mounts for any paths that need writes (e.g., /tmp).


Networking#

Ingress TLS configured#

kubectl get ingress -A -o json | jq -r '
  .items[] |
  select(.spec.tls == null or (.spec.tls | length) == 0) |
  "\(.metadata.namespace)/\(.metadata.name): no TLS configured"'

Pass: Every Ingress resource has a tls section with valid secret references.

cert-manager auto-renewal working#

kubectl get certificates -A -o json | jq -r '
  .items[] |
  "\(.metadata.namespace)/\(.metadata.name): ready=\(.status.conditions[] | select(.type=="Ready") | .status) renewal=\(.status.renewalTime)"'

Pass: All certificates show Ready=True and renewalTime is in the future.

Load balancer health checks configured#

kubectl get svc -A -o json | jq -r '
  .items[] |
  select(.spec.type == "LoadBalancer") |
  "\(.metadata.namespace)/\(.metadata.name): externalTrafficPolicy=\(.spec.externalTrafficPolicy)"'

Pass: Load balancer services exist and have appropriate externalTrafficPolicy (usually Local for preserving source IP).


Observability#

Metrics collection working#

kubectl top nodes
kubectl top pods -n app-production

Pass: Both commands return current CPU and memory usage data. If metrics-server is not installed, both commands will fail.

Logging pipeline functional#

# Check that log collector DaemonSet is running on all nodes
kubectl get daemonset -n monitoring -l app.kubernetes.io/name=promtail
# Or for fluent-bit:
kubectl get daemonset -n monitoring -l app.kubernetes.io/name=fluent-bit

# Verify logs are queryable in Loki/your logging backend
# Through Grafana Explore: query {namespace="app-production"} and confirm results appear

Pass: DaemonSet has desired=available on all nodes. Recent logs are queryable.

Alerting configured and tested#

# Check Alertmanager has receivers configured
kubectl get secret alertmanager-kube-prometheus-alertmanager -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d | head -20

# Check for active alerts
kubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring &
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head

Pass: Alertmanager config shows at least one non-default receiver (Slack, PagerDuty, etc.). Test alert was received by the team.


Reliability#

Backup strategy implemented and tested#

# Check Velero status
velero backup get
velero schedule get

Pass: At least one scheduled backup exists. The most recent backup status is Completed. A restore test has been performed in the last 30 days.

Horizontal auto-scaling configured#

kubectl get hpa -A

Pass: Critical workloads have HPA configured with appropriate min/max replicas and target metrics.

Node auto-scaling configured#

# Check Cluster Autoscaler or Karpenter
kubectl get pods -n kube-system | grep -E 'cluster-autoscaler|karpenter'

Pass: Node auto-scaler is running and configured for the node groups that host application workloads.


Operations#

GitOps or deployment pipeline working#

kubectl get applications -n argocd    # ArgoCD
# or
flux get kustomizations               # Flux

Pass: Applications are synced and healthy. Last sync was recent (within expected interval).

Rollback procedure tested#

# Verify rollback capability
kubectl rollout history deployment/<name> -n app-production

Pass: Deployment revision history exists with at least 2 entries. Team has documented and tested the rollback procedure. revisionHistoryLimit is set to a reasonable number (10 is the default).


Scoring#

Count passing items from the checklist above:

Score Assessment
30+ items pass Production ready
25-29 items pass Near ready – address gaps before launch
20-24 items pass Significant gaps – schedule a hardening sprint
Below 20 Not production ready – major work required

Generate a report listing every failing item, its risk level (critical/high/medium/low), and the specific remediation step. Critical items (cluster health, RBAC, network policies) must be fixed before go-live. High items should be fixed within the first week. Medium and low items can be tracked in a backlog.