Advanced Kubernetes Debugging#

Every Kubernetes failure follows a pattern, and every pattern has a diagnostic sequence. This guide covers the most common failure modes you will encounter in production, with the exact commands and thought process to move from symptom to resolution.

Systematic Debugging Methodology#

Before diving into specific scenarios, internalize this sequence. It applies to nearly every pod issue:

# Step 1: What state is the pod in?
kubectl get pod <pod> -n <ns> -o wide

# Step 2: What does the full pod spec and event history show?
kubectl describe pod <pod> -n <ns>

# Step 3: What did the application log before it failed?
kubectl logs <pod> -n <ns> --previous --all-containers

# Step 4: Can you get inside the container?
kubectl exec -it <pod> -n <ns> -- /bin/sh

# Step 5: Is the node healthy?
kubectl describe node <node-name>
kubectl top node <node-name>

Each failure mode below follows this pattern, with specific things to look for at each step.

CrashLoopBackOff#

CrashLoopBackOff means the container starts, exits, and Kubernetes restarts it with exponential backoff: 10s, 20s, 40s, 80s, capping at 5 minutes. The pod is not stuck – it is crashing repeatedly.

Step 1: Read the crash logs#

kubectl logs <pod> -n <ns> --previous

The --previous flag is essential. The current container may have just started and produced no output yet. The previous container’s logs show what happened right before the crash.

Step 2: Check the exit code#

kubectl describe pod <pod> -n <ns>

Find the Last State section under Containers:

Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Mon, 17 Feb 2026 14:22:01 +0000
  Finished:     Mon, 17 Feb 2026 14:22:01 +0000

Exit code interpretation:

Exit Code Meaning Common Cause
0 Success (but pod restarted anyway) Container completed when it should run forever; check restartPolicy
1 Application error Unhandled exception, missing config, failed assertion
137 SIGKILL (128 + 9) OOMKilled or external kill signal
139 SIGSEGV (128 + 11) Segmentation fault, corrupted binary, architecture mismatch
143 SIGTERM (128 + 15) Graceful shutdown requested but treated as failure

Step 3: Common causes and fixes#

Missing config or secrets: The application tries to read an environment variable or mounted file that does not exist. Logs usually show this clearly.

# Verify the secret exists and has the expected keys
kubectl get secret app-config -n <ns> -o jsonpath='{.data}' | jq 'keys'

# Verify the configmap exists
kubectl get configmap app-config -n <ns> -o yaml

Wrong entrypoint or command: The container image’s entrypoint does not exist or the args are wrong.

# Check what command the pod is actually running
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].command}'
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].args}'

Application crashes immediately: When the container exits before you can exec in, override the command to keep it alive for investigation:

# Option 1: kubectl debug with a sleep container
kubectl debug -it <pod> -n <ns> --image=busybox --target=<container> -- sleep 3600

# Option 2: Temporarily change the deployment command
kubectl edit deployment <name> -n <ns>
# Change command to: ["sleep", "infinity"]
# Then exec in and try running the real entrypoint manually

ImagePullBackOff#

The container runtime cannot pull the image. This is always a registry, naming, or authentication problem.

Diagnosis#

kubectl describe pod <pod> -n <ns>

The Events section will contain one of:

Failed to pull image "myregistry.io/app:v1.2.3": rpc error: code = NotFound
Failed to pull image "myregistry.io/app:v1.2.3": unauthorized

Common causes#

Image or tag does not exist: Verify the image is actually in the registry. A typo in the tag is the most common cause.

# For Docker Hub
docker manifest inspect myregistry.io/app:v1.2.3

# For private registries, use crane or skopeo
crane manifest myregistry.io/app:v1.2.3

Private registry without credentials: Check that imagePullSecrets is configured and the secret contains valid credentials.

# Check if imagePullSecrets is set on the pod
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.imagePullSecrets}'

# Decode and verify the credentials in the secret
kubectl get secret <pull-secret> -n <ns> \
  -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq .

If the secret exists but authentication still fails, the token may be expired. Recreate it:

kubectl create secret docker-registry <pull-secret> \
  --docker-server=myregistry.io \
  --docker-username=<user> \
  --docker-password=<token> \
  -n <ns> --dry-run=client -o yaml | kubectl apply -f -

Architecture mismatch: An AMD64 image running on an ARM64 node (or the reverse) will either fail to pull or crash with exit code 139. Check the node architecture:

kubectl get node <node> -o jsonpath='{.status.nodeInfo.architecture}'

Then verify the image supports that architecture:

crane manifest myregistry.io/app:v1.2.3 | jq '.manifests[].platform'

OOMKilled (Exit Code 137)#

The container exceeded its memory limit and the kernel killed it. This is the most common cause of exit code 137.

Diagnosis#

kubectl describe pod <pod> -n <ns>

Look for:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Understanding memory limits#

# Check current memory usage vs. limits
kubectl top pod <pod> -n <ns> --containers

# Check the configured limits
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].resources}'

The output of kubectl top shows actual usage. If it is close to the limit, the container is at risk of being OOMKilled under load.

Fixes#

Increase the memory limit if the application genuinely needs more memory:

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

Fix the memory leak if usage grows unbounded. Common patterns:

  • Unbounded caches or in-memory queues
  • Connection pools that grow but never shrink
  • Goroutine leaks in Go, thread leaks in Java

Node-level OOM: The kubelet can evict pods even if they have not hit their own limit, when the node is under memory pressure. Check node conditions:

kubectl describe node <node-name> | grep -A5 Conditions

If MemoryPressure is True, the node is evicting pods to survive. Solutions: increase node size, reduce pod density, or set proper resource requests so the scheduler does not over-commit the node.

Pending Pods#

A Pending pod has been accepted by the API server but cannot be scheduled to a node.

kubectl describe pod <pod> -n <ns>

FailedScheduling events#

Insufficient resources:

0/5 nodes are available: 5 Insufficient cpu, 3 Insufficient memory.

The scheduler cannot find a node with enough unreserved resources. Fix: scale up the cluster, reduce resource requests on other workloads, or right-size the requests on this pod.

PVC not bound:

0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims.

The PVC is waiting for a PV. Check the PVC status and the storage class:

kubectl get pvc -n <ns>
kubectl get storageclass

Common causes: storage class does not exist, no PVs available in the pool, or the PVC requests more storage than any PV offers.

Taints and tolerations / node affinity:

0/5 nodes are available: 5 node(s) had taints that the pod didn't tolerate.

Check what taints exist on the nodes:

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Verify the pod has matching tolerations, or the node affinity rules match at least one available node.

Max pods per node: Each node has a --max-pods limit (default 110, but cloud providers often set it lower). If all nodes are at capacity, new pods stay Pending.

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.pods}{"\n"}{end}'

Evicted Pods#

Evicted pods were running but the kubelet killed them because the node was under resource pressure.

# Find evicted pods
kubectl get pods -n <ns> --field-selector=status.phase=Failed | grep Evicted

# Check why
kubectl describe pod <evicted-pod> -n <ns>

The eviction message indicates the resource pressure: DiskPressure, MemoryPressure, or PIDPressure.

# Check node conditions
kubectl describe node <node-name> | grep -A10 Conditions

# Clean up evicted pods
kubectl delete pods -n <ns> --field-selector=status.phase=Failed

Prevention: set resource requests accurately so the scheduler does not over-commit nodes. Pods with no requests are the first to be evicted (BestEffort QoS class).

Stuck Terminating Pods#

A pod that stays in Terminating state is usually blocked by a finalizer or a hung preStop hook.

# Check for finalizers
kubectl get pod <pod> -n <ns> -o jsonpath='{.metadata.finalizers}'

# Check if the preStop hook is hanging
kubectl describe pod <pod> -n <ns>
# Look at the "Terminating" timestamp vs. current time

If the finalizer controller is not running (for example, a CRD controller was deleted), remove the finalizer:

kubectl patch pod <pod> -n <ns> -p '{"metadata":{"finalizers":null}}' --type=merge

Force delete as a last resort (this does not guarantee the container stops on the node):

kubectl delete pod <pod> -n <ns> --grace-period=0 --force

Stuck Terminating Namespaces#

A namespace stuck in Terminating usually has resources with finalizers that cannot be cleaned up – often because the controller that handles those finalizers has already been deleted.

# Find all resources still in the namespace
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -n1 -I{} kubectl get {} -n <ns> --no-headers 2>/dev/null

# For each stuck resource, remove its finalizers
kubectl patch <resource-type> <name> -n <ns> \
  -p '{"metadata":{"finalizers":null}}' --type=merge

The nuclear option is patching the namespace itself to remove its finalizer via the API server directly. This should be a last resort after all resources inside have been cleaned up.

Network Debugging#

When a pod cannot reach a service or external endpoint, exec into the pod (or use an ephemeral debug container) and test systematically:

# DNS resolution
kubectl exec -it <pod> -n <ns> -- nslookup <service>.<target-ns>.svc.cluster.local

# HTTP connectivity
kubectl exec -it <pod> -n <ns> -- wget -qO- http://<service>.<target-ns>.svc.cluster.local:<port>/health

# Check if endpoints exist for the target service
kubectl get endpointslices -n <target-ns> -l kubernetes.io/service-name=<service>

If DNS works but connections timeout, check NetworkPolicies:

kubectl get networkpolicy -n <target-ns>
kubectl describe networkpolicy <policy-name> -n <target-ns>

A NetworkPolicy with an ingress rule that does not include the source pod’s namespace or labels will silently drop traffic. This is the most common cause of “it works from one namespace but not another.”