Cluster Autoscaling#

Kubernetes autoscaling operates at two levels: pod-level (HPA adds or removes pod replicas) and node-level (Cluster Autoscaler adds or removes nodes). Getting them to work together requires understanding how each makes decisions.

Horizontal Pod Autoscaler (HPA)#

HPA adjusts the replica count of a Deployment, StatefulSet, or ReplicaSet based on observed metrics. The metrics-server must be running in your cluster for CPU and memory metrics.

Basic HPA on CPU#

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This scales my-app between 2 and 10 replicas, targeting 70% average CPU utilization across all pods. The HPA checks metrics every 15 seconds (default) and computes the desired replica count as:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

Your pods must have CPU requests defined. Without requests, HPA cannot calculate utilization percentages and will refuse to scale.

Multiple Metrics#

HPA v2 supports scaling on multiple metrics simultaneously. The HPA calculates the desired replica count for each metric independently and takes the maximum:

spec:
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

Custom metrics (like http_requests_per_second) require a metrics adapter such as Prometheus Adapter or Datadog Cluster Agent that implements the custom.metrics.k8s.io API.

Scaling Behavior and Stabilization#

HPA v2 lets you control how fast scaling happens in each direction. This prevents thrashing – rapid scale-up/scale-down cycles caused by metric fluctuations:

spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0    # scale up immediately
      policies:
        - type: Percent
          value: 100                    # can double pod count per period
          periodSeconds: 60
        - type: Pods
          value: 4                      # or add up to 4 pods per period
          periodSeconds: 60
      selectPolicy: Max                 # use whichever policy allows more scaling

    scaleDown:
      stabilizationWindowSeconds: 300   # wait 5 minutes before scaling down
      policies:
        - type: Percent
          value: 10                     # remove at most 10% of pods per period
          periodSeconds: 60
      selectPolicy: Min                 # use whichever policy is more conservative

The stabilization window looks at desired replica counts over the window duration and picks the highest (for scale-down) or lowest (for scale-up). A 300-second scale-down window means the HPA will not scale down until the desired count has been consistently lower for 5 minutes.

selectPolicy: Max means “use the policy that allows the most change” (aggressive). selectPolicy: Min means “use the policy that allows the least change” (conservative). selectPolicy: Disabled prevents scaling in that direction entirely.

Debugging HPA#

# Check current status and events
kubectl describe hpa my-app

# Watch scaling decisions
kubectl get hpa my-app --watch

# Common problems:
# "unable to fetch metrics" -- metrics-server not running or pods have no resource requests
# "failed to get cpu utilization" -- resource requests not set on containers
# ScalingActive condition is False -- check the events for the reason

Cluster Autoscaler#

Cluster Autoscaler adjusts the number of nodes in your cluster. It runs as a deployment and interacts with your cloud provider’s API (EKS, GKE, AKS) to add or remove nodes.

Scale-up trigger: A pod is unschedulable because no node has enough resources. The Cluster Autoscaler simulates whether adding a node from any node group would allow the pod to schedule. If yes, it provisions a node.

Scale-down trigger: A node’s utilization (sum of pod requests / node capacity) drops below a threshold (default 50%) for a sustained period (default 10 minutes). The autoscaler checks if all pods on the node can be moved elsewhere. If yes, it drains and removes the node.

Pod Disruption Budgets#

PDBs protect workloads during node scale-down. The Cluster Autoscaler respects PDBs – it will not drain a node if doing so violates a PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2        # or use maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

Without PDBs, the Cluster Autoscaler can drain all replicas of your app simultaneously during scale-down. Always define PDBs for production workloads.

Scale-Down Blockers#

Certain pods prevent node removal:

  • Pods with local storage (emptyDir is fine, hostPath is not by default)
  • Pods not managed by a controller (bare pods without a Deployment/ReplicaSet)
  • Pods with cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation
  • Kube-system pods without a PDB

Annotate pods that are safe to evict even though they have local storage:

metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

KEDA: Event-Driven Autoscaling#

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with external event sources. It can scale based on queue depth, database connections, cron schedules, Prometheus queries, and dozens of other sources.

Install KEDA:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Scale on Queue Depth#

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor    # Deployment name
  minReplicaCount: 0         # KEDA can scale to zero (HPA cannot)
  maxReplicaCount: 20
  cooldownPeriod: 300
  triggers:
    - type: rabbitmq
      metadata:
        host: amqp://guest:guest@rabbitmq.default.svc:5672/
        queueName: orders
        queueLength: "5"     # 1 pod per 5 messages in queue

KEDA also supports Prometheus queries, cron schedules, Kafka consumer lag, AWS SQS, and dozens of other trigger types. Its key advantage over plain HPA is scale-to-zero. Standard HPA requires minReplicas >= 1. KEDA manages the zero-to-one transition by watching the event source directly. Once the first pod is running, KEDA hands off to HPA for further scaling.

Putting It Together#

A production autoscaling setup typically combines all three:

  1. HPA scales pods based on CPU/memory and custom metrics.
  2. Cluster Autoscaler adds nodes when pending pods cannot be scheduled.
  3. PDBs protect workloads during scale-down events.
  4. KEDA handles event-driven workloads that need scale-to-zero.

The interaction is: HPA requests more pods, pods go to Pending because nodes are full, Cluster Autoscaler detects pending pods and provisions a node, the new pods get scheduled. On scale-down, HPA reduces replicas, Cluster Autoscaler detects underutilized nodes, respects PDBs, drains, and removes nodes.