GCP Terraform Patterns#

GCP’s Terraform provider (google and google-beta) has patterns distinct from both AWS and Azure. The biggest differences: APIs must be explicitly enabled per project, IAM uses a binding model (not inline policies), and GKE requires secondary IP ranges for VPC-native networking. GCP resources also tend to have longer creation times with more eventual consistency.

Projects and API Enablement#

Before creating any resource in GCP, the corresponding API must be enabled in the project. This is the most common source of first-time failures.

variable "project_id" {
  type        = string
  description = "GCP project ID (not the project number)"
}

# Enable required APIs
resource "google_project_service" "apis" {
  for_each = toset([
    "compute.googleapis.com",
    "container.googleapis.com",
    "sqladmin.googleapis.com",
    "servicenetworking.googleapis.com",
    "iam.googleapis.com",
    "cloudresourcemanager.googleapis.com",
  ])

  project = var.project_id
  service = each.value

  disable_on_destroy = false  # do not disable API when Terraform destroys
}

Gotcha: API enablement is eventually consistent. The API might report as enabled before it is fully ready. Add a short time_sleep or use depends_on from resource to API enablement:

resource "time_sleep" "api_warmup" {
  depends_on      = [google_project_service.apis]
  create_duration = "30s"
}

resource "google_container_cluster" "main" {
  depends_on = [time_sleep.api_warmup]
  # ...
}

Gotcha: disable_on_destroy = false is critical. Without it, terraform destroy disables the API, which cascades to deleting all resources using that API — including resources managed by other Terraform configurations.

IAM Binding Patterns#

GCP IAM has three resource types. Using the wrong one causes silent permission overwrites.

# google_project_iam_member — ADDITIVE, always safe
# Adds one member to one role. Does not affect other members in that role.
resource "google_project_iam_member" "gke_logging" {
  project = var.project_id
  role    = "roles/logging.logWriter"
  member  = "serviceAccount:${google_service_account.gke_nodes.email}"
}

# google_project_iam_binding — AUTHORITATIVE for the role
# Sets the COMPLETE list of members for a role. Removes anyone not listed.
# DANGEROUS: can silently remove permissions granted by other Terraform configs or manually.
resource "google_project_iam_binding" "editors" {
  project = var.project_id
  role    = "roles/editor"
  members = [
    "user:admin@example.com",
    "serviceAccount:ci@project.iam.gserviceaccount.com",
  ]
  # Anyone else who had roles/editor? Gone.
}

# google_project_iam_policy — AUTHORITATIVE for the ENTIRE project
# Sets ALL IAM bindings for the project. Removes everything not listed.
# EXTREMELY DANGEROUS: can lock you out of the project.
# Almost never use this.

Rule for agents: Always use google_project_iam_member. Never use google_project_iam_binding unless you are certain you control all members of that role. Never use google_project_iam_policy.

Service Accounts#

resource "google_service_account" "app" {
  account_id   = "my-app-sa"
  display_name = "My Application Service Account"
  project      = var.project_id
}

# Grant specific permissions
resource "google_project_iam_member" "app_storage" {
  project = var.project_id
  role    = "roles/storage.objectViewer"
  member  = "serviceAccount:${google_service_account.app.email}"
}

resource "google_project_iam_member" "app_sql" {
  project = var.project_id
  role    = "roles/cloudsql.client"
  member  = "serviceAccount:${google_service_account.app.email}"
}

Gotcha: GCP IAM changes are eventually consistent (typically 60 seconds, can be up to 7 minutes). If a resource fails with PERMISSION_DENIED immediately after granting a role, it may be a propagation delay, not a missing permission.

VPC Networking with Secondary Ranges#

GKE requires VPC-native networking with secondary IP ranges for pods and services:

resource "google_compute_network" "main" {
  name                    = "production-vpc"
  auto_create_subnetworks = false
  project                 = var.project_id
}

resource "google_compute_subnetwork" "gke" {
  name          = "gke-subnet"
  project       = var.project_id
  region        = var.region
  network       = google_compute_network.main.id
  ip_cidr_range = "10.0.0.0/24"    # node IPs

  secondary_ip_range {
    range_name    = "pods"
    ip_cidr_range = "10.1.0.0/16"   # 65K pod IPs
  }

  secondary_ip_range {
    range_name    = "services"
    ip_cidr_range = "10.2.0.0/20"   # 4K service IPs
  }

  private_ip_google_access = true   # nodes can reach Google APIs without external IP
}

Gotcha: auto_create_subnetworks = false is essential. The default (true) creates a subnet in every region with /20 CIDRs — almost never what you want.

Gotcha: Secondary range sizing matters. For GKE, the pods range needs to be large enough for max_pods_per_node × max_nodes. A /16 gives 65K pod IPs, which supports ~600 nodes with the default 110 pods per node.

Gotcha: private_ip_google_access = true is required for private GKE nodes to reach Google Container Registry, Cloud APIs, and other Google services without NAT.

GKE Configuration#

resource "google_container_cluster" "main" {
  name     = "production"
  project  = var.project_id
  location = var.region  # regional cluster (HA across zones)

  network    = google_compute_network.main.id
  subnetwork = google_compute_subnetwork.gke.id

  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }

  # Remove default node pool and manage separately
  remove_default_node_pool = true
  initial_node_count       = 1

  # Workload Identity
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  # Private cluster
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false  # allow kubectl from internet (or true for fully private)
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }

  # Release channel for auto-upgrades
  release_channel {
    channel = "REGULAR"  # RAPID, REGULAR, or STABLE
  }

  # Network policy enforcement
  network_policy {
    enabled  = true
    provider = "CALICO"
  }

  depends_on = [google_project_service.apis]
}

resource "google_container_node_pool" "main" {
  name     = "production-nodes"
  project  = var.project_id
  location = var.region
  cluster  = google_container_cluster.main.name

  initial_node_count = 3

  autoscaling {
    min_node_count = 2
    max_node_count = 10
  }

  node_config {
    machine_type    = "e2-standard-4"
    service_account = google_service_account.gke_nodes.email

    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]

    workload_metadata_config {
      mode = "GKE_METADATA"  # required for Workload Identity
    }

    shielded_instance_config {
      enable_secure_boot = true
    }
  }

  management {
    auto_repair  = true
    auto_upgrade = true
  }
}

Gotcha: remove_default_node_pool = true requires initial_node_count = 1. GKE creates the default pool then immediately deletes it. Without initial_node_count, Terraform fails.

Gotcha: master_ipv4_cidr_block must be a /28 that does not overlap with any subnet in the VPC. Forgetting this produces a confusing error about CIDR range conflicts.

GKE Workload Identity#

# GCP service account for the workload
resource "google_service_account" "workload" {
  account_id   = "my-app-workload"
  display_name = "My App Workload Identity"
  project      = var.project_id
}

# Allow the K8s service account to impersonate the GCP service account
resource "google_service_account_iam_member" "workload_identity" {
  service_account_id = google_service_account.workload.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${var.project_id}.svc.id.goog[default/my-app]"
}

# Grant the GCP SA permissions it needs
resource "google_project_iam_member" "workload_storage" {
  project = var.project_id
  role    = "roles/storage.objectViewer"
  member  = "serviceAccount:${google_service_account.workload.email}"
}

# K8s service account annotated with GCP SA
resource "kubernetes_service_account" "app" {
  metadata {
    name      = "my-app"
    namespace = "default"
    annotations = {
      "iam.gke.io/gcp-service-account" = google_service_account.workload.email
    }
  }
}

Gotcha: The member format for Workload Identity binding is serviceAccount:{project}.svc.id.goog[{namespace}/{sa-name}]. The brackets are literal — they are part of the member string, not formatting.

Cloud SQL with Private Networking#

# Reserve an IP range for service networking
resource "google_compute_global_address" "private_ip" {
  name          = "sql-private-ip"
  project       = var.project_id
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = google_compute_network.main.id
}

# Create the peering connection
resource "google_service_networking_connection" "private_vpc" {
  network                 = google_compute_network.main.id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.private_ip.name]

  depends_on = [google_project_service.apis]
}

resource "google_sql_database_instance" "main" {
  name             = "production-postgres"
  project          = var.project_id
  database_version = "POSTGRES_15"
  region           = var.region

  settings {
    tier              = "db-custom-2-8192"
    disk_size         = 50
    disk_autoresize   = true
    availability_type = "REGIONAL"

    ip_configuration {
      ipv4_enabled    = false           # no public IP
      private_network = google_compute_network.main.id
    }

    backup_configuration {
      enabled                        = true
      point_in_time_recovery_enabled = true
      start_time                     = "03:00"
    }

    maintenance_window {
      day  = 7  # Sunday
      hour = 3
    }
  }

  deletion_protection = true

  depends_on = [google_service_networking_connection.private_vpc]
}

Gotcha: The service networking connection must exist before Cloud SQL can use private IP. The depends_on is mandatory — without it, Terraform races and the database creation fails.

Gotcha: Cloud SQL instance names are globally unique per project and cannot be reused for 7 days after deletion. If you destroy and recreate, use a different name or wait.

Gotcha: deletion_protection = true is a GCP API flag (separate from Terraform’s lifecycle { prevent_destroy }). Set both for production databases.

Common GCP Terraform Gotchas#

Gotcha Symptom Fix
API not enabled googleapi: Error 403: API not enabled Add google_project_service for the API
API propagation delay PERMISSION_DENIED after enabling API Add time_sleep or depends_on chain
IAM eventual consistency Permission denied after granting role Wait 60 seconds, retry. Not a Terraform issue.
iam_binding overwrites Other permissions silently removed Use google_project_iam_member, never iam_binding
Cloud SQL name reuse Cannot create instance with recently deleted name Use unique names or wait 7 days
Default network exists Terraform plan shows unexpected resources Delete default network or import it
GKE secondary ranges required Cluster creation fails with IP range error Define secondary ranges on the subnet
Private cluster master CIDR Overlap error with existing ranges Use a /28 from unused CIDR space (172.16.0.0/28)
Service networking dependency Cloud SQL fails without private networking Add depends_on for service networking connection
disable_on_destroy default API disabled on terraform destroy, cascading deletes Set disable_on_destroy = false on all google_project_service
Labels vs tags GCP uses labels (key-value) not tags (network tags) Use labels for metadata, tags for firewall targeting