EKS Troubleshooting#
EKS failure modes combine Kubernetes problems with AWS-specific issues. Most fall into a handful of categories: IAM permissions, networking/security groups, missing tags, and add-on misconfiguration.
Nodes Not Joining the Cluster#
Symptoms: kubectl get nodes shows fewer nodes than expected. ASG shows instances running, but they never register.
aws-auth ConfigMap Missing Node Role#
The most common cause. Worker nodes authenticate via aws-auth. If the node IAM role is not mapped, nodes are rejected silently.
kubectl get configmap aws-auth -n kube-system -o yaml
# Verify the node role ARN appears under mapRoles with groups:
# system:bootstrappers and system:nodes
# If missing, add it:
eksctl create iamidentitymapping --cluster my-cluster \
--arn arn:aws:iam::123456789012:role/eks-node-group-role \
--group system:bootstrappers \
--group system:nodes \
--username system:node:{{EC2PrivateDNSName}}Security Group Rules#
Nodes need port 443 to the control plane (API server) and the control plane needs port 10250 to nodes (kubelet). Check bidirectional traffic is allowed in the cluster security group:
aws eks describe-cluster --name my-cluster \
--query "cluster.resourcesVpcConfig.clusterSecurityGroupId"AMI and Bootstrap Issues#
Managed node groups handle AMIs automatically. Self-managed nodes must run /etc/eks/bootstrap.sh my-cluster – if the script is missing or the cluster name is wrong, the node never joins.
If nodes appear but show NotReady, check kubelet logs via SSM (journalctl -u kubelet -f). Common causes: VPC CNI crashing (check aws-node DaemonSet), disk pressure, or memory pressure.
Pods Stuck in Pending#
Run kubectl describe pod <pod-name> and check Events. Common causes:
Insufficient resources: Events show “0/3 nodes are available: 3 Insufficient cpu”. Add nodes or configure Karpenter.
Fargate profile mismatch: Fargate pods only schedule if namespace AND labels match a profile selector exactly. Check with aws eks describe-fargate-profile. Mismatches produce no useful error – just “no nodes available.”
VPC CNI IP exhaustion: Nodes have capacity but pods stay Pending. Check aws-node logs for “ipamd: no available IP addresses”:
kubectl logs -n kube-system -l k8s-app=aws-node --tail=50
# Check remaining IPs
aws ec2 describe-subnets --subnet-ids subnet-xxx \
--query "Subnets[].{ID:SubnetId,Available:AvailableIpAddressCount}"If subnets are nearly full, enable prefix delegation or add subnets.
ALB/NLB Not Routing Traffic#
First verify the controller is running: kubectl get deployment -n kube-system aws-load-balancer-controller. Check its logs for IAM errors.
Subnet Tagging#
Missing tags are the single most common reason ALBs/NLBs fail to create. Required tags:
- Public subnets (internet-facing LBs):
kubernetes.io/role/elb = 1 - Private subnets (internal LBs):
kubernetes.io/role/internal-elb = 1 - All subnets:
kubernetes.io/cluster/<cluster-name> = shared
aws ec2 describe-subnets --subnet-ids subnet-xxx \
--query "Subnets[].Tags[?Key=='kubernetes.io/role/elb']"Target Group Health Checks Failing#
If all targets are unhealthy, no traffic routes. Check with aws elbv2 describe-target-health --target-group-arn <arn>. Common causes: health check path returns non-200, security group blocks ALB-to-pod traffic, or pod listens on wrong port.
Ingress Shows No ADDRESS#
Run kubectl describe ingress my-app -n production and check Events for “Failed to create load balancer” with subnet or IAM errors.
EBS Volumes Not Attaching#
EBS CSI Driver Not Installed#
EKS 1.23+ requires the EBS CSI driver add-on for EBS-backed PersistentVolumes. The in-tree provisioner was removed. If PVCs stay in Pending:
# Check if the driver is installed
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
# Install it
aws eks create-addon --cluster-name my-cluster --addon-name aws-ebs-csi-driver \
--service-account-role-arn arn:aws:iam::123456789012:role/ebs-csi-roleThe driver needs an IAM role with ec2:CreateVolume, ec2:AttachVolume, ec2:DetachVolume, ec2:DeleteVolume, and related permissions.
Availability Zone Mismatch#
EBS volumes are AZ-specific. If a pod is scheduled to us-east-1a but the PersistentVolume was created in us-east-1b, the volume cannot attach.
# Check the PV's AZ
kubectl get pv <pv-name> -o jsonpath='{.spec.nodeAffinity}'
# Check the pod's node AZ
kubectl get node <node-name> -L topology.kubernetes.io/zoneFix: use a StorageClass with volumeBindingMode: WaitForFirstConsumer so the volume is created in the same AZ as the pod:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
type: gp3DNS Resolution Failures#
CoreDNS Not Running#
kubectl get pods -n kube-system -l k8s-app=kube-dnsIf CoreDNS pods are crashing, check logs:
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50Common cause: CoreDNS tries to reach upstream DNS (the VPC DNS resolver at the VPC CIDR base +2, e.g., 10.0.0.2). If the node security group blocks outbound UDP/TCP port 53 to this address, DNS fails for the entire cluster.
Pod DNS Not Working But Node DNS Works#
If pods cannot resolve external names but nodes can, check that the kube-dns Service in kube-system has endpoints and that pods have the correct /etc/resolv.conf:
kubectl exec <pod> -- cat /etc/resolv.conf
# Should show: nameserver 172.20.0.10 (the kube-dns ClusterIP)
kubectl get endpoints kube-dns -n kube-system
# Should show CoreDNS pod IPsCloudWatch Container Insights#
Enable Container Insights for cluster, node, and pod metrics:
aws eks create-addon --cluster-name my-cluster \
--addon-name amazon-cloudwatch-observabilityThis sends metrics to CloudWatch under the ContainerInsights namespace – node CPU, pod memory, request vs capacity, and restart counts. Query container logs with CloudWatch Logs Insights when debugging pod restarts.