EKS Troubleshooting#

EKS failure modes combine Kubernetes problems with AWS-specific issues. Most fall into a handful of categories: IAM permissions, networking/security groups, missing tags, and add-on misconfiguration.

Nodes Not Joining the Cluster#

Symptoms: kubectl get nodes shows fewer nodes than expected. ASG shows instances running, but they never register.

aws-auth ConfigMap Missing Node Role#

The most common cause. Worker nodes authenticate via aws-auth. If the node IAM role is not mapped, nodes are rejected silently.

kubectl get configmap aws-auth -n kube-system -o yaml
# Verify the node role ARN appears under mapRoles with groups:
#   system:bootstrappers and system:nodes

# If missing, add it:
eksctl create iamidentitymapping --cluster my-cluster \
  --arn arn:aws:iam::123456789012:role/eks-node-group-role \
  --group system:bootstrappers \
  --group system:nodes \
  --username system:node:{{EC2PrivateDNSName}}

Security Group Rules#

Nodes need port 443 to the control plane (API server) and the control plane needs port 10250 to nodes (kubelet). Check bidirectional traffic is allowed in the cluster security group:

aws eks describe-cluster --name my-cluster \
  --query "cluster.resourcesVpcConfig.clusterSecurityGroupId"

AMI and Bootstrap Issues#

Managed node groups handle AMIs automatically. Self-managed nodes must run /etc/eks/bootstrap.sh my-cluster – if the script is missing or the cluster name is wrong, the node never joins.

If nodes appear but show NotReady, check kubelet logs via SSM (journalctl -u kubelet -f). Common causes: VPC CNI crashing (check aws-node DaemonSet), disk pressure, or memory pressure.

Pods Stuck in Pending#

Run kubectl describe pod <pod-name> and check Events. Common causes:

Insufficient resources: Events show “0/3 nodes are available: 3 Insufficient cpu”. Add nodes or configure Karpenter.

Fargate profile mismatch: Fargate pods only schedule if namespace AND labels match a profile selector exactly. Check with aws eks describe-fargate-profile. Mismatches produce no useful error – just “no nodes available.”

VPC CNI IP exhaustion: Nodes have capacity but pods stay Pending. Check aws-node logs for “ipamd: no available IP addresses”:

kubectl logs -n kube-system -l k8s-app=aws-node --tail=50

# Check remaining IPs
aws ec2 describe-subnets --subnet-ids subnet-xxx \
  --query "Subnets[].{ID:SubnetId,Available:AvailableIpAddressCount}"

If subnets are nearly full, enable prefix delegation or add subnets.

ALB/NLB Not Routing Traffic#

First verify the controller is running: kubectl get deployment -n kube-system aws-load-balancer-controller. Check its logs for IAM errors.

Subnet Tagging#

Missing tags are the single most common reason ALBs/NLBs fail to create. Required tags:

  • Public subnets (internet-facing LBs): kubernetes.io/role/elb = 1
  • Private subnets (internal LBs): kubernetes.io/role/internal-elb = 1
  • All subnets: kubernetes.io/cluster/<cluster-name> = shared
aws ec2 describe-subnets --subnet-ids subnet-xxx \
  --query "Subnets[].Tags[?Key=='kubernetes.io/role/elb']"

Target Group Health Checks Failing#

If all targets are unhealthy, no traffic routes. Check with aws elbv2 describe-target-health --target-group-arn <arn>. Common causes: health check path returns non-200, security group blocks ALB-to-pod traffic, or pod listens on wrong port.

Ingress Shows No ADDRESS#

Run kubectl describe ingress my-app -n production and check Events for “Failed to create load balancer” with subnet or IAM errors.

EBS Volumes Not Attaching#

EBS CSI Driver Not Installed#

EKS 1.23+ requires the EBS CSI driver add-on for EBS-backed PersistentVolumes. The in-tree provisioner was removed. If PVCs stay in Pending:

# Check if the driver is installed
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

# Install it
aws eks create-addon --cluster-name my-cluster --addon-name aws-ebs-csi-driver \
  --service-account-role-arn arn:aws:iam::123456789012:role/ebs-csi-role

The driver needs an IAM role with ec2:CreateVolume, ec2:AttachVolume, ec2:DetachVolume, ec2:DeleteVolume, and related permissions.

Availability Zone Mismatch#

EBS volumes are AZ-specific. If a pod is scheduled to us-east-1a but the PersistentVolume was created in us-east-1b, the volume cannot attach.

# Check the PV's AZ
kubectl get pv <pv-name> -o jsonpath='{.spec.nodeAffinity}'

# Check the pod's node AZ
kubectl get node <node-name> -L topology.kubernetes.io/zone

Fix: use a StorageClass with volumeBindingMode: WaitForFirstConsumer so the volume is created in the same AZ as the pod:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3

DNS Resolution Failures#

CoreDNS Not Running#

kubectl get pods -n kube-system -l k8s-app=kube-dns

If CoreDNS pods are crashing, check logs:

kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Common cause: CoreDNS tries to reach upstream DNS (the VPC DNS resolver at the VPC CIDR base +2, e.g., 10.0.0.2). If the node security group blocks outbound UDP/TCP port 53 to this address, DNS fails for the entire cluster.

Pod DNS Not Working But Node DNS Works#

If pods cannot resolve external names but nodes can, check that the kube-dns Service in kube-system has endpoints and that pods have the correct /etc/resolv.conf:

kubectl exec <pod> -- cat /etc/resolv.conf
# Should show: nameserver 172.20.0.10 (the kube-dns ClusterIP)

kubectl get endpoints kube-dns -n kube-system
# Should show CoreDNS pod IPs

CloudWatch Container Insights#

Enable Container Insights for cluster, node, and pod metrics:

aws eks create-addon --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability

This sends metrics to CloudWatch under the ContainerInsights namespace – node CPU, pod memory, request vs capacity, and restart counts. Query container logs with CloudWatch Logs Insights when debugging pod restarts.