Kubernetes Troubleshooting Decision Trees#

Troubleshooting Kubernetes in production is about eliminating possibilities in the right order. Every symptom maps to a finite set of causes, and each cause has a specific diagnostic command. The decision trees below encode that mapping. Start at the symptom, follow the branches, run the commands, and the output tells you which branch to take next.

These trees are designed to be followed mechanically. No intuition required – just execute the commands and interpret the results.

Decision Tree 1 – Pod Won’t Start#

The pod exists but never reaches the Running state. The first thing to check is what phase the pod is in.

kubectl get pod <pod> -n <ns> -o jsonpath='{.status.phase}'
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[*].state}'

Branch: Pending#

The pod has been accepted by the API server but not scheduled to a node.

kubectl describe pod <pod> -n <ns> | grep -A 20 "Events:"

Event says “Insufficient cpu” or “Insufficient memory”: No node has enough allocatable resources. Either scale up the cluster, reduce the pod’s resource requests, or evict lower-priority workloads.

kubectl get nodes -o custom-columns="NAME:.metadata.name,CPU_ALLOC:.status.allocatable.cpu,MEM_ALLOC:.status.allocatable.memory"
kubectl describe node <node> | grep -A 10 "Allocated resources"

Event says “didn’t match Pod’s node affinity/selector”: The pod has a nodeSelector or nodeAffinity that no available node satisfies.

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.nodeSelector}'
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.affinity.nodeAffinity}'
kubectl get nodes --show-labels

Event says “didn’t tolerate taint”: The matching nodes have taints that the pod does not tolerate.

kubectl get nodes -o custom-columns="NAME:.metadata.name,TAINTS:.spec.taints[*].key"
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.tolerations}'

Event says “persistentvolumeclaim not found” or “unbound”: The PVC the pod references does not exist or is not yet bound. See the PVC decision tree below.

Branch: ContainerCreating#

The pod is scheduled to a node but the container has not started yet.

kubectl describe pod <pod> -n <ns> | grep -A 5 "State:"
kubectl describe pod <pod> -n <ns> | grep -A 20 "Events:"

Event says “ErrImagePull” or “ImagePullBackOff”: The container image cannot be pulled.

# Check the exact image reference
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].image}'

Causes: image tag does not exist in the registry, incorrect registry URL, missing imagePullSecrets for a private registry, network cannot reach the registry.

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret <secret> -n <ns> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

Event says “MountVolume.SetUp failed”: A volume mount is failing. Common causes: Secret or ConfigMap referenced does not exist, PVC is not bound, NFS server unreachable, CSI driver not installed.

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.volumes}' | jq .
kubectl get configmaps -n <ns>
kubectl get secrets -n <ns>
kubectl get pvc -n <ns>

Branch: CrashLoopBackOff#

The container starts, exits, and Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5m).

# Check the previous container's logs -- the current one may have just restarted
kubectl logs <pod> -n <ns> --previous

# Check exit code
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Exit code 1: Application error. Read the logs. Common causes: missing environment variable, cannot connect to database, configuration file syntax error.

Exit code 137: Container was OOMKilled (received SIGKILL from the kernel).

kubectl describe pod <pod> -n <ns> | grep -i oom
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

Fix: increase memory limits or fix the application memory leak.

Exit code 139: Segmentation fault. The application crashed. This is a code bug, not a Kubernetes issue.

Exit code 0 with CrashLoopBackOff: The container exited successfully but Kubernetes expects it to keep running. The container’s main process is exiting immediately. Common cause: entrypoint script finishes, or the application is running in background/daemon mode instead of foreground.

Branch: Init:Error or Init:CrashLoopBackOff#

An init container is failing. Init containers run sequentially before the main containers start.

# List init containers and their states
kubectl get pod <pod> -n <ns> -o jsonpath='{range .status.initContainerStatuses[*]}{.name}{"\t"}{.state}{"\n"}{end}'

# Get logs from the failing init container
kubectl logs <pod> -n <ns> -c <init-container-name>

Common causes: init container waiting for a dependency that is not ready (database, config service), migration script failing, permission errors on shared volumes.

Decision Tree 2 – Pod Is Running but Not Working#

The pod shows Running and 1/1 Ready, but the application is not behaving correctly.

Check 1: Is the pod actually Ready?#

kubectl get pod <pod> -n <ns> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Ready")'

If Ready is False, the readiness probe is failing. Check what the probe is testing:

kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].readinessProbe}' | jq .
kubectl logs <pod> -n <ns> --tail=50

Check 2: Application logs#

kubectl logs <pod> -n <ns> --tail=100 | grep -i -E "error|exception|fatal|panic|warn"

Check 3: Environment variables and mounted config#

kubectl exec <pod> -n <ns> -- env | sort
kubectl exec <pod> -n <ns> -- cat /path/to/config/file
kubectl exec <pod> -n <ns> -- ls -la /path/to/mounted/secrets/

Check 4: Internal connectivity from inside the pod#

kubectl exec <pod> -n <ns> -- wget -qO- http://localhost:<port>/health
kubectl exec <pod> -n <ns> -- nslookup <dependency-service>.<dependency-ns>.svc.cluster.local
kubectl exec <pod> -n <ns> -- wget -qO- --timeout=5 http://<dependency-service>.<dependency-ns>.svc.cluster.local:<port>/health

Decision Tree 3 – Service Returns 502/503#

A client (or ingress controller) gets a 502 Bad Gateway or 503 Service Unavailable.

Step 1: Do endpoints exist?#

kubectl get endpoints <service> -n <ns>

If the endpoints list is empty, no pods match the Service selector.

kubectl get svc <service> -n <ns> -o jsonpath='{.spec.selector}'
kubectl get pods -n <ns> -l <key>=<value>

Check that the selector labels match the pod labels exactly. Label mismatch is the most common cause of empty endpoints.

Step 2: Are the pods Ready?#

Only Ready pods appear in the endpoints list. If pods exist but are not Ready:

kubectl get pods -n <ns> -l <key>=<value> -o wide
kubectl describe pod <pod> -n <ns> | grep -A 5 "Readiness"

Step 3: Are pods overloaded?#

kubectl top pods -n <ns> -l <key>=<value>
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].resources}'

If CPU usage is at or near the limit, pods are being throttled. If memory is near the limit, they may be about to get OOMKilled. Increase limits or add replicas.

Step 4: Check ingress controller logs#

kubectl logs -n <ingress-ns> -l app.kubernetes.io/name=ingress-nginx --tail=100

Look for upstream connection errors, timeout messages, or “no live upstreams.”

Decision Tree 4 – Can’t Connect to Service#

The application reports connection refused, timeout, or DNS resolution failure.

Step 1: DNS resolution#

kubectl exec <source-pod> -n <ns> -- nslookup <service>.<target-ns>.svc.cluster.local

DNS fails: Check CoreDNS pods are running and healthy.

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Step 2: Endpoints#

kubectl get endpoints <service> -n <target-ns>

If empty, see the selector mismatch check in Decision Tree 3.

Step 3: Network policies#

kubectl get networkpolicies -n <target-ns>
kubectl get networkpolicies -n <source-ns>

If network policies exist, check that they allow traffic from the source namespace/pod to the target on the required port.

kubectl describe networkpolicy <policy> -n <target-ns>

Step 4: Port mapping#

# Service port
kubectl get svc <service> -n <target-ns> -o jsonpath='{.spec.ports}'

# Container port
kubectl get pod <target-pod> -n <target-ns> -o jsonpath='{.spec.containers[0].ports}'

The Service targetPort must match the port the container is actually listening on. A common mistake is the Service targetPort being set to the container port name but the pod spec not defining that name.

Step 5: Is the pod listening?#

kubectl exec <target-pod> -n <target-ns> -- ss -tlnp
kubectl exec <target-pod> -n <target-ns> -- netstat -tlnp

Verify the process is listening on the expected port and on 0.0.0.0 (not 127.0.0.1). Binding to localhost inside a container means nothing outside the container can reach it.

Decision Tree 5 – Node Is NotReady#

kubectl get nodes
kubectl describe node <node>

Check node conditions#

kubectl get node <node> -o jsonpath='{.status.conditions}' | jq .

MemoryPressure=True: The node is running low on memory. Kubelet will start evicting BestEffort and Burstable pods. Check what is consuming memory and whether resource limits are set correctly.

DiskPressure=True: The node filesystem is full. Check container images, logs, and emptyDir volumes consuming disk space.

PIDPressure=True: Too many processes on the node. A pod is likely fork-bombing or spawning excessive threads.

NetworkUnavailable=True: The CNI plugin has not configured networking. Check the CNI pod (calico-node, cilium, aws-node) on that node.

Check kubelet#

# On the node (if accessible via SSH)
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" --no-pager

Check cloud provider#

If running on a cloud provider, check whether the underlying instance is healthy. The node may have been terminated by the cloud provider’s autoscaler or preempted (spot/preemptible instance).

Decision Tree 6 – Deployment Stuck During Rollout#

kubectl rollout status deployment/<name> -n <ns>

If the rollout is stuck, check the new ReplicaSet:

kubectl get rs -n <ns> -l app=<name>
kubectl describe rs <new-replicaset> -n <ns>

New pods not starting#

Check if the new pods are Pending or CrashLoopBackOff, then follow Decision Tree 1.

PodDisruptionBudget blocking#

kubectl get pdb -n <ns>
kubectl describe pdb <pdb> -n <ns>

If minAvailable is set too high relative to the current replica count, the rollout cannot proceed because it cannot safely evict old pods.

Resource unavailability#

The new pods need resources that are not available. Check the deployment’s maxSurge – if set to 0, new pods cannot be created until old pods are removed, creating a deadlock if the new pods also cannot start.

kubectl get deployment <name> -n <ns> -o jsonpath='{.spec.strategy.rollingUpdate}'

Decision Tree 7 – PVC Won’t Bind#

kubectl get pvc -n <ns>
kubectl describe pvc <pvc> -n <ns>

StorageClass exists?#

kubectl get sc
kubectl get pvc <pvc> -n <ns> -o jsonpath='{.spec.storageClassName}'

If the PVC references a StorageClass that does not exist, it will never bind.

Provisioner running?#

kubectl get pods -n kube-system | grep -i provisioner
kubectl get pods -n kube-system | grep -i csi

Access mode supported?#

Some storage backends do not support ReadWriteMany. If the PVC requests RWX but the StorageClass only supports RWO, it will not bind.

Zone affinity#

In multi-zone clusters, a PV provisioned in zone-a cannot be used by a pod scheduled in zone-b. Check the node’s zone label and the PV’s zone constraint:

kubectl get pv <pv> -o jsonpath='{.spec.nodeAffinity}'
kubectl get node <node> -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}'

Decision Tree 8 – HPA Not Scaling#

kubectl describe hpa <name> -n <ns>

Metrics server running?#

kubectl get pods -n kube-system | grep metrics-server
kubectl top pods -n <ns>

If kubectl top returns “metrics not available,” the metrics server is not working. The HPA cannot function without it.

Check HPA events#

kubectl describe hpa <name> -n <ns> | grep -A 20 "Events:"

Look for messages like “unable to get metrics,” “failed to get resource metric,” or “invalid metrics.”

Target metric exists?#

If using custom metrics, verify the custom metrics API is registered:

kubectl get apiservices | grep custom.metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

Already at max replicas?#

kubectl get hpa <name> -n <ns> -o jsonpath='{.spec.maxReplicas}'
kubectl get hpa <name> -n <ns> -o jsonpath='{.status.currentReplicas}'

If currentReplicas equals maxReplicas, the HPA has already scaled to its limit.

Cooldown period active?#

The HPA has stabilization windows to prevent flapping. By default, scale-down has a 5-minute stabilization window. Recent scale events within that window will block further changes.

kubectl get hpa <name> -n <ns> -o jsonpath='{.spec.behavior}'
kubectl get hpa <name> -n <ns> -o jsonpath='{.status.conditions}' | jq .

Using These Trees Effectively#

Each decision tree follows the same pattern: observe the current state, check events for clues, drill into the specific subsystem, identify the root cause, and apply the fix. When you encounter a Kubernetes issue in production, identify which tree matches your symptom and work through it from top to bottom. Resist the urge to skip steps – the most common troubleshooting mistake is jumping to an assumed cause without confirming the basics first.