GKE Setup and Configuration#

GKE is Google’s managed Kubernetes service. The two major decisions when creating a cluster are the mode (Standard vs Autopilot) and the networking model (VPC-native is now the default and the only option for new clusters). Everything else – node pools, release channels, Workload Identity – layers on top of those choices.

Standard vs Autopilot#

Standard mode gives you full control over node pools, machine types, and node configuration. You manage capacity, pay per node (whether pods are using the resources or not), and can run DaemonSets, privileged containers, and host-network pods.

Autopilot mode manages nodes entirely. Google provisions and scales nodes automatically based on your pod specs. You pay per pod resource request, not per node. Autopilot enforces security best practices: no privileged containers, no host-network access, no DaemonSets (use sidecar containers instead). Autopilot also mutates your pod specs – if you request 100m CPU, Autopilot may bump it to 250m to fit its scheduling model.

Use Autopilot when you want minimal operational overhead and your workloads are standard stateless services. Use Standard when you need DaemonSets, GPU node pools, custom kernel parameters, specific machine types, or Windows nodes.

Creating a Cluster with gcloud#

Standard mode:

gcloud container clusters create my-cluster \
  --region us-central1 \
  --num-nodes 2 \
  --machine-type e2-standard-4 \
  --release-channel regular \
  --enable-ip-alias \
  --workload-pool=my-project.svc.id.goog \
  --enable-autorepair \
  --enable-autoupgrade \
  --enable-autoscaling --min-nodes 1 --max-nodes 5 \
  --project my-project

Autopilot mode:

gcloud container clusters create-auto my-autopilot-cluster \
  --region us-central1 \
  --release-channel regular \
  --project my-project

Autopilot enables VPC-native networking, Workload Identity, Shielded Nodes, and Secure Boot by default. With Standard, you opt into these individually.

Node Pools#

Node pools let you run different machine types in the same cluster. Common pattern: a general-purpose pool for most workloads and a specialized pool for GPU or high-memory jobs.

gcloud container node-pools create high-mem-pool \
  --cluster my-cluster \
  --region us-central1 \
  --machine-type n2-highmem-8 \
  --num-nodes 1 \
  --enable-autoscaling --min-nodes 0 --max-nodes 3 \
  --spot \
  --node-taints workload-type=high-mem:NoSchedule

The --spot flag uses Spot VMs (formerly preemptible), which cost 60-91% less but can be reclaimed with 30 seconds notice. Always use taints on specialized pools so general workloads do not accidentally land there. Pods targeting the pool need a matching toleration and a node selector or affinity rule.

Release Channels#

GKE release channels control the Kubernetes version and how quickly upgrades happen:

Rapid: newest Kubernetes version, earliest access to features, least stable
Regular: 2-3 months behind Rapid, good balance for most production workloads
Stable: 4-5 months behind Rapid, maximum stability, fewest surprises

# Check available versions per channel
gcloud container get-server-config --region us-central1 --format="yaml(channels)"

You cannot pin to a specific minor version long-term. GKE auto-upgrades within your channel. If you need to delay an upgrade, use maintenance windows and exclusions.

Private Clusters#

Private clusters give nodes only internal IP addresses, preventing direct internet access:

gcloud container clusters create private-cluster \
  --region us-central1 \
  --enable-private-nodes \
  --enable-private-endpoint \
  --master-ipv4-cidr 172.16.0.0/28 \
  --enable-master-authorized-networks \
  --master-authorized-networks 10.0.0.0/8 \
  --enable-ip-alias \
  --workload-pool=my-project.svc.id.goog

--enable-private-endpoint makes the control plane accessible only via internal IP (no public endpoint). --master-authorized-networks restricts which CIDRs can reach the API server. If you enable private endpoint, you must access kubectl from within the VPC (or via a bastion host, VPN, or Cloud Interconnect). For private nodes to pull images from the internet, you need Cloud NAT on the subnet.

Workload Identity#

Workload Identity is the recommended way to give pods access to Google Cloud APIs. It binds a Kubernetes service account to a Google Cloud service account, eliminating the need for exported service account keys.

# Create GCP service account
gcloud iam service-accounts create my-app-sa \
  --display-name "My App Service Account"

# Grant the KSA permission to impersonate the GSA
gcloud iam service-accounts add-iam-policy-binding \
  my-app-sa@my-project.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:my-project.svc.id.goog[my-namespace/my-ksa]"

# Annotate the Kubernetes service account
kubectl annotate serviceaccount my-ksa \
  --namespace my-namespace \
  iam.gke.io/gcp-service-account=my-app-sa@my-project.iam.gserviceaccount.com

Pods using my-ksa now automatically get credentials for my-app-sa@my-project.iam.gserviceaccount.com. No key files, no mounted secrets.

Terraform Configuration#

For infrastructure-as-code, the google_container_cluster and google_container_node_pool resources are the standard approach:

resource "google_container_cluster" "primary" {
  name     = "my-cluster"
  location = "us-central1"

  release_channel {
    channel = "REGULAR"
  }

  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }

  # Remove default node pool and manage separately
  remove_default_node_pool = true
  initial_node_count       = 1
}

resource "google_container_node_pool" "general" {
  name     = "general"
  cluster  = google_container_cluster.primary.id
  location = "us-central1"

  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }

  node_config {
    machine_type = "e2-standard-4"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]

    workload_metadata_config {
      mode = "GKE_METADATA"
    }
  }
}

Always remove the default node pool and create explicit ones. The default pool cannot be fully customized after creation and leads to Terraform drift.

GKE Add-ons#

Config Connector lets you manage Google Cloud resources (Cloud SQL, Pub/Sub, GCS buckets) as Kubernetes custom resources. Enable it on the cluster and apply YAML to create cloud resources.

Backup for GKE provides cluster-level backup and restore. Enable the add-on, create a BackupPlan, and schedule backups. It captures both cluster state and persistent volume data.

Enable add-ons at cluster creation or after:

gcloud container clusters update my-cluster \
  --region us-central1 \
  --update-addons ConfigConnector=ENABLED

gcloud container clusters update my-cluster \
  --region us-central1 \
  --update-addons BackupRestore=ENABLED