Spot Instances and Preemptible Nodes#

Spot instances are unused cloud capacity sold at a steep discount – typically 60-90% off on-demand pricing. The trade-off: the cloud provider can reclaim them with minimal notice. AWS gives a 2-minute warning, GCP gives 30 seconds, and Azure varies. Running Kubernetes workloads on spot instances is one of the most effective cost reduction strategies available, but it requires architecture that tolerates sudden node loss.

Terminology Across Providers#

Provider Product Warning Time Max Lifetime
AWS Spot Instances 2 minutes No limit
GCP Spot VMs 30 seconds No limit
GCP Preemptible VMs (legacy) 30 seconds 24 hours
Azure Spot VMs 30 seconds (configurable) No limit

GCP Preemptible VMs are the older product with a mandatory 24-hour lifetime. GCP Spot VMs replaced them and have no maximum lifetime. Use Spot VMs for new deployments.

When to Use Spot#

Good candidates for spot:

  • Stateless web services with multiple replicas behind a load balancer
  • Batch processing and data pipeline jobs
  • CI/CD runners and build agents
  • Dev/staging environments (entire environments can run on spot)
  • Scale-out workers (queue consumers, stream processors)
  • Machine learning training with checkpointing

Workloads to keep on on-demand:

  • Databases and stateful singletons – data loss risk on sudden termination
  • Control plane components – etcd, API server, critical operators
  • Anything that cannot tolerate a 2-minute shutdown window
  • Pods with PVCs in a single AZ – reclamation can strand volumes (discussed in gotchas)

Architecture Pattern: Mixed Node Pools#

The standard pattern is to run two classes of node pools:

On-demand pool:  baseline capacity for critical workloads
Spot pool:       burst capacity for fault-tolerant workloads

Use taints on spot nodes to ensure only spot-tolerant pods are scheduled there:

# Taint on spot node pool
taints:
  - key: kubernetes.io/spot
    value: "true"
    effect: NoSchedule

Pods that can run on spot add a matching toleration and a node affinity preference:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker
spec:
  replicas: 10
  template:
    spec:
      tolerations:
        - key: kubernetes.io/spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: karpenter.sh/capacity-type
                    operator: In
                    values: ["spot"]
      terminationGracePeriodSeconds: 90  # must be < 120s for AWS spot
      containers:
        - name: worker
          image: myapp/worker:latest
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"

Pods without the toleration are blocked from spot nodes and remain on on-demand nodes.

Spot Interruption Handling#

AWS: Node Termination Handler#

The AWS Node Termination Handler (NTH) watches for EC2 spot interruption notices and automatically cordons and drains the affected node:

helm install aws-node-termination-handler \
  eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableScheduledEventDraining=true \
  --set enableRebalanceMonitoring=true

NTH operates in two modes:

  • IMDS mode (Instance Metadata Service): runs as a DaemonSet on each node, polls the instance metadata endpoint for interruption notices.
  • Queue mode: uses an SQS queue to receive EC2 events. More reliable and supports additional event types (rebalance recommendations, scheduled maintenance).

Queue mode is recommended for production:

helm install aws-node-termination-handler \
  eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSqsTerminationDraining=true \
  --set queueURL=https://sqs.us-east-1.amazonaws.com/123456789/spot-interruption-queue

GKE: Built-in Handling#

GKE handles spot node preemption automatically. When a Spot VM is reclaimed, GKE marks the node for deletion and drains pods. No additional components are needed, but you should still set appropriate terminationGracePeriodSeconds and PodDisruptionBudgets.

AKS: Spot Node Pools#

AKS spot node pools handle eviction at the VMSS level. Configure the eviction policy:

az aks nodepool add \
  --resource-group myRG \
  --cluster-name myCluster \
  --name spotnodepool \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --node-count 3 \
  --node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule

--eviction-policy Delete removes the VM entirely on eviction. Deallocate keeps the VM but you still lose the workload – Delete is simpler and avoids confusion.

Graceful Shutdown#

Your application must handle SIGTERM and shut down within the interruption window. For AWS spot, you have approximately 2 minutes from notice to termination, minus the time NTH takes to cordon and begin draining (typically 10-15 seconds).

// Go example: graceful shutdown on SIGTERM
func main() {
    srv := &http.Server{Addr: ":8080"}

    go func() {
        sigCh := make(chan os.Signal, 1)
        signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
        <-sigCh

        log.Println("received shutdown signal, draining connections...")
        ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
        defer cancel()
        srv.Shutdown(ctx)
    }()

    srv.ListenAndServe()
}

Set terminationGracePeriodSeconds to less than the interruption window:

spec:
  terminationGracePeriodSeconds: 90  # 90 seconds, well under 2-minute AWS warning

PodDisruptionBudgets#

PDBs protect against too many pods being evicted simultaneously:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: worker-pdb
spec:
  minAvailable: "60%"
  selector:
    matchLabels:
      app: worker

During a spot reclamation, the drain process respects PDBs. If evicting a pod would violate the PDB, the drain blocks until other replicas are available. However, if the node is forcibly terminated (after the 2-minute window), PDBs are bypassed – the VM simply disappears.

Instance Type Diversification#

The biggest risk with spot is capacity unavailability. If you request only m5.xlarge spot instances and that specific type is in high demand, you get no capacity. Diversifying across multiple instance types and availability zones dramatically improves availability.

AWS with Karpenter#

Karpenter automatically selects from a wide range of instance types based on your constraints:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-workers
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]  # prefer spot, fall back to on-demand
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["m", "c", "r"]  # general, compute, memory families
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["4"]  # 5th gen and newer
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["xlarge", "2xlarge", "4xlarge"]
      taints:
        - key: kubernetes.io/spot
          value: "true"
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: "200"
    memory: "400Gi"

With this configuration, Karpenter might launch a c5.2xlarge in us-east-1a, an m6i.xlarge in us-east-1b, and an r5.xlarge in us-east-1c – whatever has the best spot availability and pricing at that moment. When a spot instance is reclaimed, Karpenter automatically launches a replacement from the available pool.

AWS Managed Node Groups#

If not using Karpenter, configure managed node groups with the capacity-optimized-prioritized allocation strategy:

# eksctl nodegroup configuration
managedNodeGroups:
  - name: spot-workers
    instanceTypes: ["m5.xlarge", "m5a.xlarge", "m6i.xlarge", "c5.xlarge", "c5a.xlarge", "r5.xlarge"]
    spot: true
    desiredCapacity: 5
    minSize: 0
    maxSize: 20

The capacity-optimized strategy selects instance types from pools with the most available capacity, reducing the frequency of interruptions.

GKE Spot Node Pools#

gcloud container node-pools create spot-pool \
  --cluster=my-cluster \
  --spot \
  --num-nodes=3 \
  --machine-type=e2-standard-4 \
  --node-taints=cloud.google.com/gke-spot=true:NoSchedule

GKE does not support mixed instance types within a single node pool the same way AWS does. Use multiple spot node pools with different machine types for diversification.

Cost Tracking#

Spot savings are often invisible in basic Kubernetes monitoring because the cluster does not know what you are paying per node. Use cloud-native cost tools:

# AWS: check spot pricing history
aws ec2 describe-spot-price-history \
  --instance-types m5.xlarge m5a.xlarge c5.xlarge \
  --product-descriptions "Linux/UNIX" \
  --start-time $(date -u -v-1d +%Y-%m-%dT%H:%M:%S) \
  --query 'SpotPriceHistory[*].{Type:InstanceType,AZ:AvailabilityZone,Price:SpotPrice}' \
  --output table

Kubecost can distinguish spot vs on-demand costs when given access to your cloud billing data, showing the actual savings achieved.

Common Gotchas#

Mass reclamation during capacity crunches. When a cloud region runs low on capacity, many spot instances are reclaimed simultaneously. If all your spot nodes disappear at once, the remaining on-demand nodes face a thundering herd of rescheduling pods. Mitigate with: PDBs, topology spread constraints, and enough on-demand baseline capacity to absorb the critical workloads.

PVCs stuck in the wrong AZ. When a spot node in us-east-1a is reclaimed, any PVC attached to pods on that node stays bound to the us-east-1a zone. If the replacement node lands in us-east-1b, the pod cannot mount the volume. Solutions: use topology-aware scheduling (volumeBindingMode: WaitForFirstConsumer), or use EFS/Filestore (cross-AZ storage) for spot workloads.

Spot interruption during deployment rollout. If a spot node is reclaimed mid-rollout, the new and old ReplicaSets both lose pods. Combined with a tight PDB, this can stall the rollout. Set rollout maxUnavailable and maxSurge with spot interruptions in mind.

Practical Example: EKS with On-Demand Baseline and Spot Overflow#

A production EKS cluster running a web application with background workers:

  • On-demand node pool (3x m6i.2xlarge): runs the API servers, databases, Redis, and Prometheus. These pods have no spot toleration.
  • Spot node pool (Karpenter-managed, 5-15 nodes): runs background workers, batch processors, and non-critical services. Karpenter selects from 15+ instance types across 3 AZs.

Monthly cost breakdown:

  • On-demand: 3 nodes at $0.384/hr = $829/month
  • Spot: average 8 nodes at $0.10/hr (avg spot price for mixed types) = $576/month
  • Same workload fully on-demand would cost: $2,650/month
  • Total savings: ~47% compared to all on-demand

The spot nodes experience 2-3 interruptions per week. Karpenter replaces each within 60 seconds. Application-level retries handle the in-flight requests that are lost during the interruption window.