AKS Troubleshooting#

AKS problems fall into categories: node pool operations stuck or failed, pods not scheduling, storage not provisioning, authentication broken, and ingress not working. Each has Azure-specific causes that generic Kubernetes debugging will not surface.

Node Pool Stuck in Updating or Failed#

Node pool operations (scaling, upgrading, changing settings) can get stuck. The AKS API reports the pool as “Updating” indefinitely or transitions to “Failed.”

# Check node pool provisioning state
az aks nodepool show \
  --resource-group myapp-rg \
  --cluster-name myapp-aks \
  --name workload \
  --query provisioningState

# Check the activity log for errors
az monitor activity-log list \
  --resource-group myapp-rg \
  --query "[?contains(operationName.value, 'Microsoft.ContainerService')].{op:operationName.value, status:status.value, msg:properties.statusMessage}" \
  --output table

Common causes and fixes:

VM quota exceeded: The operation fails because your subscription hit the vCPU quota for the VM size in that region. Check with az vm list-usage --location eastus2 -o table. Request a quota increase through the Azure portal.

Subnet IP exhaustion (Azure CNI classic): The subnet has no IPs left for new nodes. Each node reserves IPs for itself plus its max pod count (default 30). A /24 subnet (251 usable IPs) supports only about 8 nodes with default settings. Solutions: use Azure CNI Overlay instead, or expand the subnet.

VMSS operation failure: Underlying VMSS (VM Scale Set) errors block node pool operations. Check the VMSS instance view:

# Get the VMSS name (in the MC_ resource group)
az vmss list --resource-group MC_myapp-rg_myapp-aks_eastus2 --query "[].name" -o tsv

# Check VMSS instances for errors
az vmss get-instance-view --resource-group MC_myapp-rg_myapp-aks_eastus2 --name <vmss-name> \
  --query "statuses[?code!='ProvisioningState/succeeded']"

Reconcile a failed node pool:

az aks nodepool update \
  --resource-group myapp-rg \
  --cluster-name myapp-aks \
  --name workload

Running an update with no changes forces AKS to reconcile the node pool state. If the pool is truly stuck, contact Azure Support – they can reset the provisioning state on the backend.

Pods Stuck in Pending#

Pending pods mean the scheduler cannot find a node to place them on.

kubectl describe pod <pod-name> -n <namespace>
# Look at the Events section for scheduling failure reasons

Insufficient resources with no autoscaler: The cluster has no nodes with enough CPU/memory. Either add nodes to the node pool or enable the cluster autoscaler:

az aks nodepool update \
  --resource-group myapp-rg \
  --cluster-name myapp-aks \
  --name workload \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 20

Taint mismatch: System node pools have the CriticalAddonsOnly taint. Spot pools have the spot priority taint. If your pods lack matching tolerations, they cannot schedule there. Check taints with kubectl describe node <node-name> | grep Taints.

Topology constraints or affinity rules: Pod topology spread constraints or node affinity can prevent scheduling if the cluster does not have nodes in the required zones or with the required labels.

Subnet exhaustion (Azure CNI classic): Even if you have nodes, new pods fail to get IPs. The kubelet logs show errors about IP allocation. Check with kubectl describe node <node-name> and look at Allocated resources – compare pods running vs max pods for that node.

PVC Stuck in Pending#

Persistent Volume Claims stuck in Pending usually indicate a StorageClass or infrastructure problem.

kubectl describe pvc <pvc-name> -n <namespace>
# Look for events like "waiting for first consumer" or provisioner errors

WaitForFirstConsumer binding mode: The default AKS StorageClasses use WaitForFirstConsumer, meaning the PV is not provisioned until a pod actually consumes the PVC. If the pod is also Pending (due to scheduling issues), the PVC stays Pending too. Fix the pod scheduling problem first.

Disk SKU not available: Requesting a Premium_LRS disk on a node VM that does not support premium storage (like Standard_B2s). The provisioner silently fails. Check VM size capabilities:

az vm list-skus --location eastus2 --size Standard_D4s --query "[].capabilities[?name=='PremiumIO']" -o table

Zone mismatch: AKS provisions Azure Disks in a specific availability zone. If the pod gets scheduled to a node in zone 2 but the disk was created in zone 1, the pod cannot mount it. Disks are zonal resources. Use volumeBindingMode: WaitForFirstConsumer (default) to avoid this, as it delays disk creation until the pod is scheduled.

StorageClass does not exist: If a PVC references a non-existent StorageClass, it stays Pending with no helpful error. Check kubectl get storageclass and verify the name matches.

# AKS default storage classes
kubectl get storageclass
# managed-csi          (Azure Disk, StandardSSD_LRS)
# managed-csi-premium  (Azure Disk, Premium_LRS)
# azurefile-csi        (Azure Files, Standard_LRS)
# azurefile-csi-premium (Azure Files, Premium_LRS)

Azure AD Authentication Failures#

Symptoms: kubectl commands return Unauthorized or hang waiting for a browser login that never completes.

kubelogin not installed or not configured:

# Install kubelogin
az aks install-cli

# Convert kubeconfig to use kubelogin
kubelogin convert-kubeconfig -l azurecli

# For interactive login, clear cached tokens
kubelogin remove-tokens

Wrong login mode: kubelogin supports multiple modes: azurecli (uses az login session), devicecode (for headless environments), spn (service principal), msi (managed identity on Azure VMs). If you are on a VM, use msi. If in CI/CD, use spn. If at your terminal with az CLI, use azurecli.

Stale kubeconfig: If the cluster was recreated or AAD integration was changed, old kubeconfigs have stale endpoints. Re-fetch with az aks get-credentials --overwrite-existing.

User not in admin group: With --disable-local-accounts, users must be in the AAD admin group or have explicit Azure RBAC role assignments. Check group membership in the Azure portal.

AGIC Sync Issues#

The Application Gateway Ingress Controller pod watches Ingress resources and configures Application Gateway. When changes do not appear:

# Check AGIC pod logs
kubectl logs -n kube-system -l app=ingress-appgw

# Common errors:
# "IngressClass not matched" -- the Ingress is missing the annotation or ingressClassName
# "failed to update Application Gateway" -- AGIC identity lacks permissions
# "x]  Error applying config" -- conflicting configurations

AGIC takes 30-60 seconds to reconcile. If after two minutes changes are not reflected, check: (1) the Ingress has the correct kubernetes.io/ingress.class: azure/application-gateway annotation, (2) the AGIC managed identity has Contributor on the Application Gateway resource, (3) the Application Gateway subnet has no NSG rules blocking AGIC communication.

Node NotReady#

Nodes showing NotReady indicate the kubelet is not communicating with the API server.

kubectl describe node <node-name>
# Check Conditions section: MemoryPressure, DiskPressure, PIDPressure, Ready

# Check kubelet logs via node debug
kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/busybox:2.0
# Inside the debug pod:
chroot /host
journalctl -u kubelet --no-pager -n 100

Common causes: VM extension provisioning failures (the CSE – Custom Script Extension – that bootstraps the node failed), kubelet certificate expiry, or the node ran out of disk space. For extension failures, check the VMSS instance view. For persistent issues, cordon the node, drain it, and delete it – the cluster autoscaler or VMSS will provision a replacement:

kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
az vmss delete-instances --resource-group MC_myapp-rg_myapp-aks_eastus2 \
  --name <vmss-name> --instance-ids <instance-id>

Quick Reference: az aks Diagnostic Commands#

# Run cluster diagnostics
az aks kollect --resource-group myapp-rg --name myapp-aks --storage-account <storage-account>

# Run kubectl commands on private clusters without network access
az aks command invoke --resource-group myapp-rg --name myapp-aks \
  --command "kubectl get pods -A | grep -v Running"

# Show cluster and node pool details
az aks show --resource-group myapp-rg --name myapp-aks -o table
az aks nodepool list --resource-group myapp-rg --cluster-name myapp-aks -o table