GKE Troubleshooting#

GKE adds a layer of Google Cloud infrastructure on top of Kubernetes, which means some problems are pure Kubernetes issues and others are GKE-specific. This guide covers the GKE-specific problems that trip people up.

Autopilot Resource Adjustment#

Autopilot automatically mutates pod resource requests to fit its scheduling model. If you request cpu: 100m and memory: 128Mi, Autopilot may bump the request to cpu: 250m and memory: 512Mi. This affects your billing (you pay per resource request) and can cause unexpected OOMKills if the limits were set relative to the original request.

Check what Autopilot actually set:

kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources}'

Compare with your Deployment spec. If they differ, Autopilot adjusted them. The autopilot.gke.io/resource-adjustment annotation on the pod shows why.

For workloads needing specific compute profiles, use compute class annotations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  template:
    metadata:
      annotations:
        cloud.google.com/compute-class: "Performance"
    spec:
      containers:
        - name: model
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"

Available compute classes include General-Purpose (default), Balanced, Scale-Out, and Performance. Each maps to different machine families and has different minimum resource requirements.

Node Auto-Repair and Auto-Upgrade Disruptions#

GKE automatically repairs unhealthy nodes and upgrades node versions. Both operations drain pods from the node, which can cause downtime if you are not prepared.

Auto-repair triggers when a node fails health checks for about 10 minutes. GKE drains the node, deletes the VM, and creates a new one. Pods without PodDisruptionBudgets can all be evicted simultaneously.

Auto-upgrade rolls through nodes one at a time (configurable with surge upgrade settings). The default is 1 surge node and 0 max unavailable, meaning GKE creates a new node, drains the old one, then deletes it.

# Control upgrade aggressiveness
gcloud container node-pools update general-pool \
  --cluster my-cluster --region us-central1 \
  --max-surge-upgrade 3 \
  --max-unavailable-upgrade 1

Maintenance windows control when upgrades happen:

gcloud container clusters update my-cluster \
  --region us-central1 \
  --maintenance-window-start 2026-01-01T02:00:00Z \
  --maintenance-window-end 2026-01-01T06:00:00Z \
  --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA,SU"

If you are seeing unexpected pod restarts, check if they coincide with node upgrades:

kubectl get events --field-selector reason=Drain --sort-by='.lastTimestamp'
gcloud container operations list --filter="operationType=UPGRADE_NODES" \
  --region us-central1

GKE Ingress Not Working#

GKE Ingress creates a Google Cloud HTTP(S) Load Balancer. When it fails, the symptoms are 502 errors, the Ingress ADDRESS never populating, or health checks failing.

Step 1: Check Ingress status and events.

kubectl describe ingress my-ingress
# Look for events like:
# "Error syncing to GCP: error running load balancer syncing routine"
# "Error during sync: error getting BackendConfig"

Step 2: Check backend health in the Load Balancer.

# Find the backend service
gcloud compute backend-services list --filter="name~k8s"

# Check health status
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global

If all backends are UNHEALTHY, the health check is failing. GKE Ingress health checks use the readiness probe path from your pod spec by default. If you have no readiness probe, the health check hits / on the serving port. If your app does not respond with 200 on that path, everything is unhealthy.

Fix with BackendConfig:

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: app-backend-config
spec:
  healthCheck:
    requestPath: /healthz
    port: 8080
    type: HTTP
---
apiVersion: v1
kind: Service
metadata:
  name: web-frontend
  annotations:
    cloud.google.com/backend-config: '{"default": "app-backend-config"}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: 8080
  selector:
    app: web-frontend

Step 3: Check firewall rules. GKE creates firewall rules to allow health check traffic from Google’s health check ranges (130.211.0.0/22 and 35.191.0.0/16). If these rules are missing or blocked, health checks fail silently.

gcloud compute firewall-rules list --filter="name~gke" --format="table(name,direction,sourceRanges,allowed)"

NEG sync issues. When using container-native load balancing (NEGs), the NEG controller must sync pod endpoints. If pods are crashing or not becoming Ready, the NEG has no healthy endpoints. Check NEG status:

gcloud compute network-endpoint-groups list
gcloud compute network-endpoint-groups list-network-endpoints NEG_NAME --zone ZONE

PVC Issues#

GKE uses the pd-csi-driver for Persistent Volumes backed by Compute Engine persistent disks.

Common problem: PVC stuck in Pending.

kubectl describe pvc my-pvc
# Look for events like:
# "waiting for first consumer to be created before binding"
# "could not attach disk: RESOURCE_NOT_FOUND"

WaitForFirstConsumer binding mode means the PV is not created until a pod claims it. This is correct behavior – it ensures the disk is created in the same zone as the pod. If no pod references the PVC, it stays Pending forever.

Regional persistent disks for HA:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: regional-pd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
  - matchLabelExpressions:
      - key: topology.gke.io/zone
        values:
          - us-central1-a
          - us-central1-b

Regional PDs replicate across two zones. They cost twice as much but survive a zone failure. The pod can reschedule to the other zone and the PV follows.

IP Exhaustion in VPC-Native Clusters#

Each GKE node consumes a /24 from the pod secondary range (256 IPs). If your pod range is /20 (4096 IPs), you can only have 16 nodes before the range is exhausted, regardless of how many pods are actually running.

Symptoms: nodes fail to join the cluster, node autoscaler refuses to add nodes, errors referencing IP_SPACE_EXHAUSTED.

# Check remaining pod IPs
gcloud container clusters describe my-cluster --region us-central1 \
  --format="value(ipAllocationPolicy.clusterSecondaryRangeName)"

# View subnet utilization
gcloud compute networks subnets describe gke-subnet \
  --region us-central1 \
  --format="table(secondaryIpRanges.rangeName,secondaryIpRanges.ipCidrRange)"

Remediation: you cannot resize a secondary range in place. Options are (1) create a new cluster with a larger range, (2) add a discontiguous multi-pod-CIDR range (GKE supports this now), or (3) reduce per-node pod density with --max-pods-per-node (e.g., setting it to 32 uses a /26 per node instead of /24).

GKE Logging and Monitoring#

GKE integrates with Cloud Operations (formerly Stackdriver) by default. System and workload logs flow to Cloud Logging, and metrics go to Cloud Monitoring.

# Query container logs
gcloud logging read 'resource.type="k8s_container" AND resource.labels.cluster_name="my-cluster" AND resource.labels.namespace_name="production"' \
  --limit 50 --format json

# Filter by severity
gcloud logging read 'resource.type="k8s_container" AND severity>=ERROR AND resource.labels.cluster_name="my-cluster"' \
  --limit 20

For cost control, you can disable workload logging and only keep system logs:

gcloud container clusters update my-cluster \
  --region us-central1 \
  --logging=SYSTEM

The options are SYSTEM, WORKLOAD, API_SERVER, SCHEDULER, and CONTROLLER_MANAGER. Most production clusters use SYSTEM,WORKLOAD. Adding API_SERVER logs produces high volume and cost but is essential for security auditing.

Cloud Error Reporting automatically groups and deduplicates application errors from container logs. If your application writes structured JSON logs with a severity field and stack traces, Error Reporting picks them up without additional configuration.