Ephemeral Cloud Clusters#
Ephemeral clusters exist for one purpose: validate something, then disappear. They are not staging environments, not shared dev clusters, not long-lived resources that someone forgets to turn off. The operational model is strict – create, validate, destroy – and the entire sequence must be automated so that destruction cannot be forgotten.
The cost of getting this wrong is real. A three-node EKS cluster left running over a weekend costs roughly $15. Left running for a month, $200. Multiply by the number of developers or CI pipelines that create clusters, and forgotten ephemeral infrastructure becomes a significant budget line item. Every template in this article includes auto-destroy mechanisms to prevent this.
The Create-Validate-Destroy Pattern#
Every ephemeral cluster follows the same lifecycle:
- Create – Terraform provisions the cluster with minimal configuration. No monitoring stack, no ingress controllers, no persistent storage unless the validation requires it.
- Configure – Get kubeconfig, install any test dependencies (a Helm chart being validated, a set of manifests, a database operator).
- Validate – Run the actual tests. Helm install succeeds, pods reach Running state, services respond on expected ports, integration tests pass.
- Destroy – Terraform destroys everything. No partial cleanup, no orphaned resources.
The critical rule: steps 1 through 4 must execute in a single automated sequence. If step 3 fails, step 4 still runs. If step 2 fails, step 4 still runs. The only acceptable outcome is that the cluster no longer exists when the sequence completes.
#!/bin/bash
set -euo pipefail
CLUSTER_DIR="$1"
VALIDATION_SCRIPT="$2"
cleanup() {
echo "Destroying ephemeral cluster..."
cd "$CLUSTER_DIR"
terraform destroy -auto-approve -input=false 2>&1 | tail -20
}
trap cleanup EXIT
cd "$CLUSTER_DIR"
terraform init -input=false
terraform apply -auto-approve -input=false
# Extract kubeconfig
terraform output -raw kubeconfig > /tmp/ephemeral-kubeconfig
export KUBECONFIG=/tmp/ephemeral-kubeconfig
# Wait for nodes to be ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
# Run validation
bash "$VALIDATION_SCRIPT"The trap cleanup EXIT is the most important line. It ensures terraform destroy runs regardless of how the script exits – success, failure, or signal.
Ephemeral EKS on AWS#
Terraform Configuration#
This module creates a minimal EKS cluster with managed node groups. It uses the official terraform-aws-modules/eks/aws module to avoid reinventing VPC and IAM configuration.
# main.tf
terraform {
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.region
}
variable "region" {
default = "us-east-1"
}
variable "cluster_name" {
default = "ephemeral"
}
variable "ttl_hours" {
description = "Hours before auto-destroy (used for tagging)"
default = 4
}
locals {
destroy_after = timeadd(timestamp(), "${var.ttl_hours}h")
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "${var.cluster_name}-vpc"
cidr = "10.0.0.0/16"
azs = ["${var.region}a", "${var.region}b"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
tags = {
Environment = "ephemeral"
DestroyAfter = local.destroy_after
}
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = var.cluster_name
cluster_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
cluster_endpoint_public_access = true
eks_managed_node_groups = {
ephemeral = {
instance_types = ["t3.medium"]
min_size = 1
max_size = 3
desired_size = 2
}
}
tags = {
Environment = "ephemeral"
DestroyAfter = local.destroy_after
}
}
output "kubeconfig" {
value = <<-EOT
apiVersion: v1
kind: Config
clusters:
- cluster:
server: ${module.eks.cluster_endpoint}
certificate-authority-data: ${module.eks.cluster_certificate_authority_data}
name: ${var.cluster_name}
contexts:
- context:
cluster: ${var.cluster_name}
user: ${var.cluster_name}
name: ${var.cluster_name}
current-context: ${var.cluster_name}
users:
- name: ${var.cluster_name}
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
command: aws
args: ["eks", "get-token", "--cluster-name", "${var.cluster_name}", "--region", "${var.region}"]
EOT
sensitive = true
}Cost Estimate#
EKS control plane: $0.10/hour. Two t3.medium nodes: $0.0416/hour each. NAT gateway: $0.045/hour. Total: approximately $0.23/hour or $5.50/day. The single NAT gateway and two-AZ VPC are the cheapest configuration that still allows EKS to function (EKS requires subnets in at least two AZs).
Apply and Validate#
terraform init -input=false
terraform apply -auto-approve -input=false -var="cluster_name=test-$(date +%s)"
terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig
# Validate cluster is functional
kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=120s
kubectl delete namespace validation-testEphemeral GKE on GCP#
GKE Autopilot is the best choice for ephemeral clusters because you pay only for running pods, there are no idle node costs, and you do not need to manage node pools.
Terraform Configuration#
# main.tf
terraform {
required_version = ">= 1.5"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
provider "google" {
project = var.project_id
region = var.region
}
variable "project_id" {
description = "GCP project ID"
}
variable "region" {
default = "us-central1"
}
variable "cluster_name" {
default = "ephemeral"
}
variable "ttl_hours" {
default = 4
}
resource "google_container_cluster" "ephemeral" {
name = var.cluster_name
location = var.region
enable_autopilot = true
release_channel {
channel = "RAPID"
}
resource_labels = {
environment = "ephemeral"
destroy-after = formatdate("YYYY-MM-DD-hh-mm", timeadd(timestamp(), "${var.ttl_hours}h"))
}
deletion_protection = false
}
output "kubeconfig" {
value = <<-EOT
apiVersion: v1
kind: Config
clusters:
- cluster:
server: https://${google_container_cluster.ephemeral.endpoint}
certificate-authority-data: ${google_container_cluster.ephemeral.master_auth[0].cluster_ca_certificate}
name: ${var.cluster_name}
contexts:
- context:
cluster: ${var.cluster_name}
user: ${var.cluster_name}
name: ${var.cluster_name}
current-context: ${var.cluster_name}
users:
- name: ${var.cluster_name}
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
command: gke-gcloud-auth-plugin
installHint: "Install gke-gcloud-auth-plugin for kubectl"
EOT
sensitive = true
}Cost Estimate#
GKE Autopilot charges per pod resource: $0.000017/vCPU-second, $0.000002/GB-second. The management fee is $0.10/hour. For a typical validation workload running 2 vCPUs and 4GB RAM for one hour, the cost is approximately $0.22. Autopilot has no idle node costs – if no pods are running, you pay only the management fee.
Apply and Validate#
terraform init -input=false
terraform apply -auto-approve -input=false \
-var="project_id=my-project" \
-var="cluster_name=eph-$(date +%s)"
terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig
# GKE Autopilot may take a few minutes to schedule pods
kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test --requests='cpu=250m,memory=256Mi'
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=300s
kubectl delete namespace validation-testNote the explicit --requests flag. Autopilot requires resource requests on all pods. Pods without requests get default values, which may not match your expectations.
Ephemeral AKS on Azure#
Terraform Configuration#
# main.tf
terraform {
required_version = ">= 1.5"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
}
}
provider "azurerm" {
features {}
}
variable "location" {
default = "eastus"
}
variable "cluster_name" {
default = "ephemeral"
}
variable "ttl_hours" {
default = 4
}
resource "azurerm_resource_group" "ephemeral" {
name = "${var.cluster_name}-rg"
location = var.location
tags = {
Environment = "ephemeral"
DestroyAfter = timeadd(timestamp(), "${var.ttl_hours}h")
}
}
resource "azurerm_kubernetes_cluster" "ephemeral" {
name = var.cluster_name
location = azurerm_resource_group.ephemeral.location
resource_group_name = azurerm_resource_group.ephemeral.name
dns_prefix = var.cluster_name
default_node_pool {
name = "default"
node_count = 2
vm_size = "Standard_B2s"
}
identity {
type = "SystemAssigned"
}
tags = {
Environment = "ephemeral"
DestroyAfter = timeadd(timestamp(), "${var.ttl_hours}h")
}
}
output "kubeconfig" {
value = azurerm_kubernetes_cluster.ephemeral.kube_config_raw
sensitive = true
}Cost Estimate#
Two Standard_B2s nodes: approximately $0.042/hour each. AKS control plane: free (unlike EKS). Total: approximately $0.084/hour or $2.00/day. AKS is the cheapest option for ephemeral clusters because the control plane has no charge.
Apply and Validate#
terraform init -input=false
terraform apply -auto-approve -input=false \
-var="cluster_name=eph-$(date +%s)"
terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig
kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=120s
kubectl delete namespace validation-testAuto-Destroy Mechanisms#
The Terraform configurations above tag resources with a DestroyAfter timestamp, but tags alone do not destroy anything. You need an active mechanism to enforce the TTL.
CI-Triggered Destroy#
The simplest approach: the same CI job that creates the cluster also destroys it. The wrapper script at the top of this article demonstrates this. In GitHub Actions:
jobs:
ephemeral-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Create, validate, destroy
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
bash scripts/ephemeral-test.sh ./terraform/eks ./tests/validate.shCron-Based Cleanup#
For clusters created outside CI (manual testing, development), run a scheduled cleanup job that finds and destroys expired resources:
#!/bin/bash
# cleanup-expired-clusters.sh
# Run via cron: 0 * * * * /path/to/cleanup-expired-clusters.sh
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# AWS: find EKS clusters tagged as ephemeral and past TTL
aws eks list-clusters --output json | jq -r '.clusters[]' | while read cluster; do
destroy_after=$(aws eks describe-cluster --name "$cluster" \
--query "cluster.tags.DestroyAfter" --output text 2>/dev/null)
if [[ "$destroy_after" != "None" && "$destroy_after" < "$NOW" ]]; then
echo "Destroying expired cluster: $cluster (expired: $destroy_after)"
# Use terraform destroy if state is available, otherwise:
aws eks delete-nodegroup --cluster-name "$cluster" --nodegroup-name ephemeral --no-cli-pager
aws eks delete-cluster --name "$cluster" --no-cli-pager
fi
doneTTL-Based with AWS Lambda#
For a fully automated approach, deploy a Lambda function triggered by EventBridge on a schedule:
# lambda_function.py
import boto3
from datetime import datetime, timezone
def handler(event, context):
eks = boto3.client('eks')
clusters = eks.list_clusters()['clusters']
for cluster_name in clusters:
cluster = eks.describe_cluster(name=cluster_name)['cluster']
tags = cluster.get('tags', {})
if tags.get('Environment') != 'ephemeral':
continue
destroy_after = tags.get('DestroyAfter')
if not destroy_after:
continue
if datetime.fromisoformat(destroy_after.replace('Z', '+00:00')) < datetime.now(timezone.utc):
print(f"Destroying expired cluster: {cluster_name}")
# Delete node groups first
nodegroups = eks.list_nodegroups(clusterName=cluster_name)['nodegroups']
for ng in nodegroups:
eks.delete_nodegroup(clusterName=cluster_name, nodegroupName=ng)
# Wait for nodegroups to delete, then delete cluster
# In production, use a Step Function for this orchestrationHow an Agent Should Use These#
An agent tasked with validating infrastructure changes should follow this exact sequence:
-
Select the cheapest provider for the validation type. If the validation only needs a running Kubernetes cluster (not provider-specific features), use AKS ($0.084/hour) or GKE Autopilot (pay-per-pod). If the validation tests AWS-specific integrations (ALB ingress, EBS CSI, IAM roles for service accounts), use EKS.
-
Set a tight TTL. Most validations complete in under 30 minutes. Set
ttl_hours=1as the default. Only increase it if the validation is known to take longer. -
Use the wrapper script. Never run
terraform applywithout thetrap cleanup EXITpattern. The risk of forgetting to destroy is too high. -
Fail fast on creation errors. If the cluster fails to create (quota limits, permissions issues, region capacity), do not retry automatically. Report the error and let a human investigate. Retrying in a loop can create partially-provisioned resources that are harder to clean up.
-
Log the cost. After destroy, estimate and log the cost:
duration_hours * hourly_rate. This creates visibility into ephemeral cluster spending over time.
Cost Comparison Summary#
| Provider | Hourly Cost (2 nodes) | Daily Cost | Control Plane | Best For |
|---|---|---|---|---|
| EKS | ~$0.23/hr | ~$5.50/day | $0.10/hr | AWS-specific testing |
| GKE Autopilot | ~$0.22/hr (varies) | ~$5.30/day | $0.10/hr | Pay-per-pod, no idle cost |
| AKS | ~$0.084/hr | ~$2.00/day | Free | Cheapest option |
These costs assume the cheapest viable node types and minimal configuration. Production-like configurations with larger nodes, multiple AZs, and additional services will cost more.