Ephemeral Cloud Clusters#

Ephemeral clusters exist for one purpose: validate something, then disappear. They are not staging environments, not shared dev clusters, not long-lived resources that someone forgets to turn off. The operational model is strict – create, validate, destroy – and the entire sequence must be automated so that destruction cannot be forgotten.

The cost of getting this wrong is real. A three-node EKS cluster left running over a weekend costs roughly $15. Left running for a month, $200. Multiply by the number of developers or CI pipelines that create clusters, and forgotten ephemeral infrastructure becomes a significant budget line item. Every template in this article includes auto-destroy mechanisms to prevent this.

The Create-Validate-Destroy Pattern#

Every ephemeral cluster follows the same lifecycle:

  1. Create – Terraform provisions the cluster with minimal configuration. No monitoring stack, no ingress controllers, no persistent storage unless the validation requires it.
  2. Configure – Get kubeconfig, install any test dependencies (a Helm chart being validated, a set of manifests, a database operator).
  3. Validate – Run the actual tests. Helm install succeeds, pods reach Running state, services respond on expected ports, integration tests pass.
  4. Destroy – Terraform destroys everything. No partial cleanup, no orphaned resources.

The critical rule: steps 1 through 4 must execute in a single automated sequence. If step 3 fails, step 4 still runs. If step 2 fails, step 4 still runs. The only acceptable outcome is that the cluster no longer exists when the sequence completes.

#!/bin/bash
set -euo pipefail

CLUSTER_DIR="$1"
VALIDATION_SCRIPT="$2"

cleanup() {
  echo "Destroying ephemeral cluster..."
  cd "$CLUSTER_DIR"
  terraform destroy -auto-approve -input=false 2>&1 | tail -20
}
trap cleanup EXIT

cd "$CLUSTER_DIR"
terraform init -input=false
terraform apply -auto-approve -input=false

# Extract kubeconfig
terraform output -raw kubeconfig > /tmp/ephemeral-kubeconfig
export KUBECONFIG=/tmp/ephemeral-kubeconfig

# Wait for nodes to be ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

# Run validation
bash "$VALIDATION_SCRIPT"

The trap cleanup EXIT is the most important line. It ensures terraform destroy runs regardless of how the script exits – success, failure, or signal.

Ephemeral EKS on AWS#

Terraform Configuration#

This module creates a minimal EKS cluster with managed node groups. It uses the official terraform-aws-modules/eks/aws module to avoid reinventing VPC and IAM configuration.

# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.region
}

variable "region" {
  default = "us-east-1"
}

variable "cluster_name" {
  default = "ephemeral"
}

variable "ttl_hours" {
  description = "Hours before auto-destroy (used for tagging)"
  default     = 4
}

locals {
  destroy_after = timeadd(timestamp(), "${var.ttl_hours}h")
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "${var.cluster_name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["${var.region}a", "${var.region}b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  tags = {
    Environment  = "ephemeral"
    DestroyAfter = local.destroy_after
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = var.cluster_name
  cluster_version = "1.29"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnets

  cluster_endpoint_public_access = true

  eks_managed_node_groups = {
    ephemeral = {
      instance_types = ["t3.medium"]
      min_size       = 1
      max_size       = 3
      desired_size   = 2
    }
  }

  tags = {
    Environment  = "ephemeral"
    DestroyAfter = local.destroy_after
  }
}

output "kubeconfig" {
  value = <<-EOT
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        server: ${module.eks.cluster_endpoint}
        certificate-authority-data: ${module.eks.cluster_certificate_authority_data}
      name: ${var.cluster_name}
    contexts:
    - context:
        cluster: ${var.cluster_name}
        user: ${var.cluster_name}
      name: ${var.cluster_name}
    current-context: ${var.cluster_name}
    users:
    - name: ${var.cluster_name}
      user:
        exec:
          apiVersion: client.authentication.k8s.io/v1beta1
          command: aws
          args: ["eks", "get-token", "--cluster-name", "${var.cluster_name}", "--region", "${var.region}"]
  EOT
  sensitive = true
}

Cost Estimate#

EKS control plane: $0.10/hour. Two t3.medium nodes: $0.0416/hour each. NAT gateway: $0.045/hour. Total: approximately $0.23/hour or $5.50/day. The single NAT gateway and two-AZ VPC are the cheapest configuration that still allows EKS to function (EKS requires subnets in at least two AZs).

Apply and Validate#

terraform init -input=false
terraform apply -auto-approve -input=false -var="cluster_name=test-$(date +%s)"

terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig

# Validate cluster is functional
kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=120s
kubectl delete namespace validation-test

Ephemeral GKE on GCP#

GKE Autopilot is the best choice for ephemeral clusters because you pay only for running pods, there are no idle node costs, and you do not need to manage node pools.

Terraform Configuration#

# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

variable "project_id" {
  description = "GCP project ID"
}

variable "region" {
  default = "us-central1"
}

variable "cluster_name" {
  default = "ephemeral"
}

variable "ttl_hours" {
  default = 4
}

resource "google_container_cluster" "ephemeral" {
  name     = var.cluster_name
  location = var.region

  enable_autopilot = true

  release_channel {
    channel = "RAPID"
  }

  resource_labels = {
    environment   = "ephemeral"
    destroy-after = formatdate("YYYY-MM-DD-hh-mm", timeadd(timestamp(), "${var.ttl_hours}h"))
  }

  deletion_protection = false
}

output "kubeconfig" {
  value = <<-EOT
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        server: https://${google_container_cluster.ephemeral.endpoint}
        certificate-authority-data: ${google_container_cluster.ephemeral.master_auth[0].cluster_ca_certificate}
      name: ${var.cluster_name}
    contexts:
    - context:
        cluster: ${var.cluster_name}
        user: ${var.cluster_name}
      name: ${var.cluster_name}
    current-context: ${var.cluster_name}
    users:
    - name: ${var.cluster_name}
      user:
        exec:
          apiVersion: client.authentication.k8s.io/v1beta1
          command: gke-gcloud-auth-plugin
          installHint: "Install gke-gcloud-auth-plugin for kubectl"
  EOT
  sensitive = true
}

Cost Estimate#

GKE Autopilot charges per pod resource: $0.000017/vCPU-second, $0.000002/GB-second. The management fee is $0.10/hour. For a typical validation workload running 2 vCPUs and 4GB RAM for one hour, the cost is approximately $0.22. Autopilot has no idle node costs – if no pods are running, you pay only the management fee.

Apply and Validate#

terraform init -input=false
terraform apply -auto-approve -input=false \
  -var="project_id=my-project" \
  -var="cluster_name=eph-$(date +%s)"

terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig

# GKE Autopilot may take a few minutes to schedule pods
kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test --requests='cpu=250m,memory=256Mi'
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=300s
kubectl delete namespace validation-test

Note the explicit --requests flag. Autopilot requires resource requests on all pods. Pods without requests get default values, which may not match your expectations.

Ephemeral AKS on Azure#

Terraform Configuration#

# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

variable "location" {
  default = "eastus"
}

variable "cluster_name" {
  default = "ephemeral"
}

variable "ttl_hours" {
  default = 4
}

resource "azurerm_resource_group" "ephemeral" {
  name     = "${var.cluster_name}-rg"
  location = var.location

  tags = {
    Environment  = "ephemeral"
    DestroyAfter = timeadd(timestamp(), "${var.ttl_hours}h")
  }
}

resource "azurerm_kubernetes_cluster" "ephemeral" {
  name                = var.cluster_name
  location            = azurerm_resource_group.ephemeral.location
  resource_group_name = azurerm_resource_group.ephemeral.name
  dns_prefix          = var.cluster_name

  default_node_pool {
    name       = "default"
    node_count = 2
    vm_size    = "Standard_B2s"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = {
    Environment  = "ephemeral"
    DestroyAfter = timeadd(timestamp(), "${var.ttl_hours}h")
  }
}

output "kubeconfig" {
  value     = azurerm_kubernetes_cluster.ephemeral.kube_config_raw
  sensitive = true
}

Cost Estimate#

Two Standard_B2s nodes: approximately $0.042/hour each. AKS control plane: free (unlike EKS). Total: approximately $0.084/hour or $2.00/day. AKS is the cheapest option for ephemeral clusters because the control plane has no charge.

Apply and Validate#

terraform init -input=false
terraform apply -auto-approve -input=false \
  -var="cluster_name=eph-$(date +%s)"

terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig

kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=120s
kubectl delete namespace validation-test

Auto-Destroy Mechanisms#

The Terraform configurations above tag resources with a DestroyAfter timestamp, but tags alone do not destroy anything. You need an active mechanism to enforce the TTL.

CI-Triggered Destroy#

The simplest approach: the same CI job that creates the cluster also destroys it. The wrapper script at the top of this article demonstrates this. In GitHub Actions:

jobs:
  ephemeral-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - name: Create, validate, destroy
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          bash scripts/ephemeral-test.sh ./terraform/eks ./tests/validate.sh

Cron-Based Cleanup#

For clusters created outside CI (manual testing, development), run a scheduled cleanup job that finds and destroys expired resources:

#!/bin/bash
# cleanup-expired-clusters.sh
# Run via cron: 0 * * * * /path/to/cleanup-expired-clusters.sh

NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)

# AWS: find EKS clusters tagged as ephemeral and past TTL
aws eks list-clusters --output json | jq -r '.clusters[]' | while read cluster; do
  destroy_after=$(aws eks describe-cluster --name "$cluster" \
    --query "cluster.tags.DestroyAfter" --output text 2>/dev/null)

  if [[ "$destroy_after" != "None" && "$destroy_after" < "$NOW" ]]; then
    echo "Destroying expired cluster: $cluster (expired: $destroy_after)"
    # Use terraform destroy if state is available, otherwise:
    aws eks delete-nodegroup --cluster-name "$cluster" --nodegroup-name ephemeral --no-cli-pager
    aws eks delete-cluster --name "$cluster" --no-cli-pager
  fi
done

TTL-Based with AWS Lambda#

For a fully automated approach, deploy a Lambda function triggered by EventBridge on a schedule:

# lambda_function.py
import boto3
from datetime import datetime, timezone

def handler(event, context):
    eks = boto3.client('eks')
    clusters = eks.list_clusters()['clusters']

    for cluster_name in clusters:
        cluster = eks.describe_cluster(name=cluster_name)['cluster']
        tags = cluster.get('tags', {})

        if tags.get('Environment') != 'ephemeral':
            continue

        destroy_after = tags.get('DestroyAfter')
        if not destroy_after:
            continue

        if datetime.fromisoformat(destroy_after.replace('Z', '+00:00')) < datetime.now(timezone.utc):
            print(f"Destroying expired cluster: {cluster_name}")
            # Delete node groups first
            nodegroups = eks.list_nodegroups(clusterName=cluster_name)['nodegroups']
            for ng in nodegroups:
                eks.delete_nodegroup(clusterName=cluster_name, nodegroupName=ng)
            # Wait for nodegroups to delete, then delete cluster
            # In production, use a Step Function for this orchestration

How an Agent Should Use These#

An agent tasked with validating infrastructure changes should follow this exact sequence:

  1. Select the cheapest provider for the validation type. If the validation only needs a running Kubernetes cluster (not provider-specific features), use AKS ($0.084/hour) or GKE Autopilot (pay-per-pod). If the validation tests AWS-specific integrations (ALB ingress, EBS CSI, IAM roles for service accounts), use EKS.

  2. Set a tight TTL. Most validations complete in under 30 minutes. Set ttl_hours=1 as the default. Only increase it if the validation is known to take longer.

  3. Use the wrapper script. Never run terraform apply without the trap cleanup EXIT pattern. The risk of forgetting to destroy is too high.

  4. Fail fast on creation errors. If the cluster fails to create (quota limits, permissions issues, region capacity), do not retry automatically. Report the error and let a human investigate. Retrying in a loop can create partially-provisioned resources that are harder to clean up.

  5. Log the cost. After destroy, estimate and log the cost: duration_hours * hourly_rate. This creates visibility into ephemeral cluster spending over time.

Cost Comparison Summary#

Provider Hourly Cost (2 nodes) Daily Cost Control Plane Best For
EKS ~$0.23/hr ~$5.50/day $0.10/hr AWS-specific testing
GKE Autopilot ~$0.22/hr (varies) ~$5.30/day $0.10/hr Pay-per-pod, no idle cost
AKS ~$0.084/hr ~$2.00/day Free Cheapest option

These costs assume the cheapest viable node types and minimal configuration. Production-like configurations with larger nodes, multiple AZs, and additional services will cost more.