Cluster Autoscaling#
Kubernetes autoscaling operates at two levels: pod-level (HPA adds or removes pod replicas) and node-level (Cluster Autoscaler adds or removes nodes). Getting them to work together requires understanding how each makes decisions.
Horizontal Pod Autoscaler (HPA)#
HPA adjusts the replica count of a Deployment, StatefulSet, or ReplicaSet based on observed metrics. The metrics-server must be running in your cluster for CPU and memory metrics.
Basic HPA on CPU#
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70This scales my-app between 2 and 10 replicas, targeting 70% average CPU utilization across all pods. The HPA checks metrics every 15 seconds (default) and computes the desired replica count as:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))Your pods must have CPU requests defined. Without requests, HPA cannot calculate utilization percentages and will refuse to scale.
Multiple Metrics#
HPA v2 supports scaling on multiple metrics simultaneously. The HPA calculates the desired replica count for each metric independently and takes the maximum:
spec:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"Custom metrics (like http_requests_per_second) require a metrics adapter such as Prometheus Adapter or Datadog Cluster Agent that implements the custom.metrics.k8s.io API.
Scaling Behavior and Stabilization#
HPA v2 lets you control how fast scaling happens in each direction. This prevents thrashing – rapid scale-up/scale-down cycles caused by metric fluctuations:
spec:
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # scale up immediately
policies:
- type: Percent
value: 100 # can double pod count per period
periodSeconds: 60
- type: Pods
value: 4 # or add up to 4 pods per period
periodSeconds: 60
selectPolicy: Max # use whichever policy allows more scaling
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 minutes before scaling down
policies:
- type: Percent
value: 10 # remove at most 10% of pods per period
periodSeconds: 60
selectPolicy: Min # use whichever policy is more conservativeThe stabilization window looks at desired replica counts over the window duration and picks the highest (for scale-down) or lowest (for scale-up). A 300-second scale-down window means the HPA will not scale down until the desired count has been consistently lower for 5 minutes.
selectPolicy: Max means “use the policy that allows the most change” (aggressive). selectPolicy: Min means “use the policy that allows the least change” (conservative). selectPolicy: Disabled prevents scaling in that direction entirely.
Debugging HPA#
# Check current status and events
kubectl describe hpa my-app
# Watch scaling decisions
kubectl get hpa my-app --watch
# Common problems:
# "unable to fetch metrics" -- metrics-server not running or pods have no resource requests
# "failed to get cpu utilization" -- resource requests not set on containers
# ScalingActive condition is False -- check the events for the reasonCluster Autoscaler#
Cluster Autoscaler adjusts the number of nodes in your cluster. It runs as a deployment and interacts with your cloud provider’s API (EKS, GKE, AKS) to add or remove nodes.
Scale-up trigger: A pod is unschedulable because no node has enough resources. The Cluster Autoscaler simulates whether adding a node from any node group would allow the pod to schedule. If yes, it provisions a node.
Scale-down trigger: A node’s utilization (sum of pod requests / node capacity) drops below a threshold (default 50%) for a sustained period (default 10 minutes). The autoscaler checks if all pods on the node can be moved elsewhere. If yes, it drains and removes the node.
Pod Disruption Budgets#
PDBs protect workloads during node scale-down. The Cluster Autoscaler respects PDBs – it will not drain a node if doing so violates a PDB:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2 # or use maxUnavailable: 1
selector:
matchLabels:
app: my-appWithout PDBs, the Cluster Autoscaler can drain all replicas of your app simultaneously during scale-down. Always define PDBs for production workloads.
Scale-Down Blockers#
Certain pods prevent node removal:
- Pods with local storage (
emptyDiris fine,hostPathis not by default) - Pods not managed by a controller (bare pods without a Deployment/ReplicaSet)
- Pods with
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"annotation - Kube-system pods without a PDB
Annotate pods that are safe to evict even though they have local storage:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"KEDA: Event-Driven Autoscaling#
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with external event sources. It can scale based on queue depth, database connections, cron schedules, Prometheus queries, and dozens of other sources.
Install KEDA:
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespaceScale on Queue Depth#
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
spec:
scaleTargetRef:
name: order-processor # Deployment name
minReplicaCount: 0 # KEDA can scale to zero (HPA cannot)
maxReplicaCount: 20
cooldownPeriod: 300
triggers:
- type: rabbitmq
metadata:
host: amqp://guest:guest@rabbitmq.default.svc:5672/
queueName: orders
queueLength: "5" # 1 pod per 5 messages in queueKEDA also supports Prometheus queries, cron schedules, Kafka consumer lag, AWS SQS, and dozens of other trigger types. Its key advantage over plain HPA is scale-to-zero. Standard HPA requires minReplicas >= 1. KEDA manages the zero-to-one transition by watching the event source directly. Once the first pod is running, KEDA hands off to HPA for further scaling.
Putting It Together#
A production autoscaling setup typically combines all three:
- HPA scales pods based on CPU/memory and custom metrics.
- Cluster Autoscaler adds nodes when pending pods cannot be scheduled.
- PDBs protect workloads during scale-down events.
- KEDA handles event-driven workloads that need scale-to-zero.
The interaction is: HPA requests more pods, pods go to Pending because nodes are full, Cluster Autoscaler detects pending pods and provisions a node, the new pods get scheduled. On scale-down, HPA reduces replicas, Cluster Autoscaler detects underutilized nodes, respects PDBs, drains, and removes nodes.