Spot Instances and Preemptible Nodes#
Spot instances are unused cloud capacity sold at a steep discount – typically 60-90% off on-demand pricing. The trade-off: the cloud provider can reclaim them with minimal notice. AWS gives a 2-minute warning, GCP gives 30 seconds, and Azure varies. Running Kubernetes workloads on spot instances is one of the most effective cost reduction strategies available, but it requires architecture that tolerates sudden node loss.
Terminology Across Providers#
| Provider | Product | Warning Time | Max Lifetime |
|---|---|---|---|
| AWS | Spot Instances | 2 minutes | No limit |
| GCP | Spot VMs | 30 seconds | No limit |
| GCP | Preemptible VMs (legacy) | 30 seconds | 24 hours |
| Azure | Spot VMs | 30 seconds (configurable) | No limit |
GCP Preemptible VMs are the older product with a mandatory 24-hour lifetime. GCP Spot VMs replaced them and have no maximum lifetime. Use Spot VMs for new deployments.
When to Use Spot#
Good candidates for spot:
- Stateless web services with multiple replicas behind a load balancer
- Batch processing and data pipeline jobs
- CI/CD runners and build agents
- Dev/staging environments (entire environments can run on spot)
- Scale-out workers (queue consumers, stream processors)
- Machine learning training with checkpointing
Workloads to keep on on-demand:
- Databases and stateful singletons – data loss risk on sudden termination
- Control plane components – etcd, API server, critical operators
- Anything that cannot tolerate a 2-minute shutdown window
- Pods with PVCs in a single AZ – reclamation can strand volumes (discussed in gotchas)
Architecture Pattern: Mixed Node Pools#
The standard pattern is to run two classes of node pools:
On-demand pool: baseline capacity for critical workloads
Spot pool: burst capacity for fault-tolerant workloadsUse taints on spot nodes to ensure only spot-tolerant pods are scheduled there:
# Taint on spot node pool
taints:
- key: kubernetes.io/spot
value: "true"
effect: NoSchedulePods that can run on spot add a matching toleration and a node affinity preference:
apiVersion: apps/v1
kind: Deployment
metadata:
name: worker
spec:
replicas: 10
template:
spec:
tolerations:
- key: kubernetes.io/spot
operator: Equal
value: "true"
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
terminationGracePeriodSeconds: 90 # must be < 120s for AWS spot
containers:
- name: worker
image: myapp/worker:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"Pods without the toleration are blocked from spot nodes and remain on on-demand nodes.
Spot Interruption Handling#
AWS: Node Termination Handler#
The AWS Node Termination Handler (NTH) watches for EC2 spot interruption notices and automatically cordons and drains the affected node:
helm install aws-node-termination-handler \
eks/aws-node-termination-handler \
--namespace kube-system \
--set enableSpotInterruptionDraining=true \
--set enableScheduledEventDraining=true \
--set enableRebalanceMonitoring=trueNTH operates in two modes:
- IMDS mode (Instance Metadata Service): runs as a DaemonSet on each node, polls the instance metadata endpoint for interruption notices.
- Queue mode: uses an SQS queue to receive EC2 events. More reliable and supports additional event types (rebalance recommendations, scheduled maintenance).
Queue mode is recommended for production:
helm install aws-node-termination-handler \
eks/aws-node-termination-handler \
--namespace kube-system \
--set enableSqsTerminationDraining=true \
--set queueURL=https://sqs.us-east-1.amazonaws.com/123456789/spot-interruption-queueGKE: Built-in Handling#
GKE handles spot node preemption automatically. When a Spot VM is reclaimed, GKE marks the node for deletion and drains pods. No additional components are needed, but you should still set appropriate terminationGracePeriodSeconds and PodDisruptionBudgets.
AKS: Spot Node Pools#
AKS spot node pools handle eviction at the VMSS level. Configure the eviction policy:
az aks nodepool add \
--resource-group myRG \
--cluster-name myCluster \
--name spotnodepool \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--node-count 3 \
--node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule--eviction-policy Delete removes the VM entirely on eviction. Deallocate keeps the VM but you still lose the workload – Delete is simpler and avoids confusion.
Graceful Shutdown#
Your application must handle SIGTERM and shut down within the interruption window. For AWS spot, you have approximately 2 minutes from notice to termination, minus the time NTH takes to cordon and begin draining (typically 10-15 seconds).
// Go example: graceful shutdown on SIGTERM
func main() {
srv := &http.Server{Addr: ":8080"}
go func() {
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
<-sigCh
log.Println("received shutdown signal, draining connections...")
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
srv.Shutdown(ctx)
}()
srv.ListenAndServe()
}Set terminationGracePeriodSeconds to less than the interruption window:
spec:
terminationGracePeriodSeconds: 90 # 90 seconds, well under 2-minute AWS warningPodDisruptionBudgets#
PDBs protect against too many pods being evicted simultaneously:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: worker-pdb
spec:
minAvailable: "60%"
selector:
matchLabels:
app: workerDuring a spot reclamation, the drain process respects PDBs. If evicting a pod would violate the PDB, the drain blocks until other replicas are available. However, if the node is forcibly terminated (after the 2-minute window), PDBs are bypassed – the VM simply disappears.
Instance Type Diversification#
The biggest risk with spot is capacity unavailability. If you request only m5.xlarge spot instances and that specific type is in high demand, you get no capacity. Diversifying across multiple instance types and availability zones dramatically improves availability.
AWS with Karpenter#
Karpenter automatically selects from a wide range of instance types based on your constraints:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-workers
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # prefer spot, fall back to on-demand
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c", "r"] # general, compute, memory families
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"] # 5th gen and newer
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["xlarge", "2xlarge", "4xlarge"]
taints:
- key: kubernetes.io/spot
value: "true"
effect: NoSchedule
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: "200"
memory: "400Gi"With this configuration, Karpenter might launch a c5.2xlarge in us-east-1a, an m6i.xlarge in us-east-1b, and an r5.xlarge in us-east-1c – whatever has the best spot availability and pricing at that moment. When a spot instance is reclaimed, Karpenter automatically launches a replacement from the available pool.
AWS Managed Node Groups#
If not using Karpenter, configure managed node groups with the capacity-optimized-prioritized allocation strategy:
# eksctl nodegroup configuration
managedNodeGroups:
- name: spot-workers
instanceTypes: ["m5.xlarge", "m5a.xlarge", "m6i.xlarge", "c5.xlarge", "c5a.xlarge", "r5.xlarge"]
spot: true
desiredCapacity: 5
minSize: 0
maxSize: 20The capacity-optimized strategy selects instance types from pools with the most available capacity, reducing the frequency of interruptions.
GKE Spot Node Pools#
gcloud container node-pools create spot-pool \
--cluster=my-cluster \
--spot \
--num-nodes=3 \
--machine-type=e2-standard-4 \
--node-taints=cloud.google.com/gke-spot=true:NoScheduleGKE does not support mixed instance types within a single node pool the same way AWS does. Use multiple spot node pools with different machine types for diversification.
Cost Tracking#
Spot savings are often invisible in basic Kubernetes monitoring because the cluster does not know what you are paying per node. Use cloud-native cost tools:
# AWS: check spot pricing history
aws ec2 describe-spot-price-history \
--instance-types m5.xlarge m5a.xlarge c5.xlarge \
--product-descriptions "Linux/UNIX" \
--start-time $(date -u -v-1d +%Y-%m-%dT%H:%M:%S) \
--query 'SpotPriceHistory[*].{Type:InstanceType,AZ:AvailabilityZone,Price:SpotPrice}' \
--output tableKubecost can distinguish spot vs on-demand costs when given access to your cloud billing data, showing the actual savings achieved.
Common Gotchas#
Mass reclamation during capacity crunches. When a cloud region runs low on capacity, many spot instances are reclaimed simultaneously. If all your spot nodes disappear at once, the remaining on-demand nodes face a thundering herd of rescheduling pods. Mitigate with: PDBs, topology spread constraints, and enough on-demand baseline capacity to absorb the critical workloads.
PVCs stuck in the wrong AZ. When a spot node in us-east-1a is reclaimed, any PVC attached to pods on that node stays bound to the us-east-1a zone. If the replacement node lands in us-east-1b, the pod cannot mount the volume. Solutions: use topology-aware scheduling (volumeBindingMode: WaitForFirstConsumer), or use EFS/Filestore (cross-AZ storage) for spot workloads.
Spot interruption during deployment rollout. If a spot node is reclaimed mid-rollout, the new and old ReplicaSets both lose pods. Combined with a tight PDB, this can stall the rollout. Set rollout maxUnavailable and maxSurge with spot interruptions in mind.
Practical Example: EKS with On-Demand Baseline and Spot Overflow#
A production EKS cluster running a web application with background workers:
- On-demand node pool (3x
m6i.2xlarge): runs the API servers, databases, Redis, and Prometheus. These pods have no spot toleration. - Spot node pool (Karpenter-managed, 5-15 nodes): runs background workers, batch processors, and non-critical services. Karpenter selects from 15+ instance types across 3 AZs.
Monthly cost breakdown:
- On-demand: 3 nodes at $0.384/hr = $829/month
- Spot: average 8 nodes at $0.10/hr (avg spot price for mixed types) = $576/month
- Same workload fully on-demand would cost: $2,650/month
- Total savings: ~47% compared to all on-demand
The spot nodes experience 2-3 interruptions per week. Karpenter replaces each within 60 seconds. Application-level retries handle the in-flight requests that are lost during the interruption window.