Choosing an Autoscaling Strategy#

Kubernetes autoscaling operates at two distinct layers: pod-level scaling changes how many pods run or how large they are, while node-level scaling changes how many nodes exist in the cluster to host those pods. Getting the right combination of tools at each layer is the key to a system that responds to demand without wasting resources.

The Two Scaling Layers#

Understanding which layer a tool operates on prevents the most common misconfiguration – expecting pod-level scaling to solve node-level capacity problems, or vice versa.

Pod-level scaling (what runs):

HPA – adjusts replica count
VPA – adjusts resource requests per pod
KEDA – adjusts replica count based on external event sources

Node-level scaling (where it runs):

Cluster Autoscaler – adds/removes nodes based on pending pods
Karpenter – provisions optimally-sized nodes based on pending pod requirements

In most production clusters, you need at least one tool from each layer.

Comparison Table#

Criteria	HPA	VPA	KEDA	Cluster Autoscaler	Karpenter
Scaling dimension	Horizontal (replicas)	Vertical (CPU/memory)	Horizontal (replicas)	Nodes	Nodes
Trigger types	CPU, memory, custom metrics	Historical resource usage	60+ event sources (queues, cron, Prometheus, custom)	Pending pods	Pending pods
Scaling speed	Seconds to minutes	Minutes (requires pod restart)	Seconds to minutes	Minutes (3-10 min typical)	Seconds to minutes
Scale to zero	No (minReplicas >= 1)	N/A	Yes	Yes (remove empty nodes)	Yes (remove empty nodes)
Complexity	Low	Low-Medium	Medium	Low	Medium
Maturity	Built-in, GA	Beta/Stable (addon)	CNCF Graduated	Mature, widely deployed	Production-ready, growing adoption
Cloud dependency	None	None	None	Cloud-specific node group config	AWS primary, GKE support emerging
Cost impact	More pods = more node demand	Right-sized pods = less waste	Efficient scale-to-zero	Matches node supply to demand	Optimal instance selection, spot support

HPA (Horizontal Pod Autoscaler)#

HPA adds or removes pod replicas based on observed metrics. It is built into Kubernetes and requires only a metrics source (metrics-server for CPU/memory, or a custom metrics adapter for application metrics).

Choose HPA when:

Your workload is stateless and can scale horizontally.
Scaling should be driven by CPU utilization, memory utilization, or request rate.
You need the simplest, most battle-tested autoscaling option.
Your application handles the transition well when replicas are added or removed.

Limitations:

Cannot scale to zero – minReplicas must be at least 1.
Does not help if pods are incorrectly sized (requesting too much or too little CPU/memory).
Scaling on memory is often unreliable because many applications do not release memory predictably.
Custom metrics require deploying a metrics adapter (Prometheus adapter, Datadog adapter, etc.).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

The behavior section is critical for production. Without a stabilization window, HPA will flap – scaling down immediately after a traffic spike subsides, then scaling back up when the next request burst arrives.

VPA (Vertical Pod Autoscaler)#

VPA analyzes historical resource usage and adjusts pod CPU and memory requests. In Auto mode, it evicts pods and recreates them with updated requests. In Off mode (recommendation-only), it provides suggestions without changing anything.

Choose VPA when:

You need to right-size workloads that were deployed with guessed resource requests.
The workload cannot scale horizontally (singletons, leader-elected processes, some databases).
You want data-driven resource recommendations without manual profiling.
Running in Off mode to feed recommendations into your deployment pipeline.

Limitations:

In Auto mode, VPA evicts pods to apply new resource requests. This causes brief disruption.
VPA and HPA cannot both target CPU on the same workload. HPA scales replicas based on CPU utilization percentages, while VPA changes the denominator (the request value) that those percentages are calculated from. This creates a feedback loop. You can combine VPA (targeting memory) with HPA (targeting CPU), or use VPA in Off mode alongside HPA.
VPA needs several hours of data before making useful recommendations.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"  # Recommendation only, no pod eviction
  resourcePolicy:
    containerPolicies:
    - containerName: api-server
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi

KEDA (Kubernetes Event-Driven Autoscaler)#

KEDA extends HPA with 60+ external event sources as scaling triggers. It can scale Deployments, StatefulSets, Jobs, and custom resources. Its defining feature is the ability to scale to zero replicas and back up when events arrive.

Choose KEDA when:

You need to scale based on queue depth (SQS, RabbitMQ, Kafka consumer lag, Azure Service Bus).
You need scale-to-zero capability to save costs during idle periods.
Your scaling trigger is not a standard Kubernetes metric (Prometheus query, cron schedule, external API).
You are processing events or messages rather than serving synchronous HTTP traffic.

Limitations:

Adds an external dependency (KEDA operator) to the cluster.
More configuration surface area than plain HPA.
Scale-from-zero has cold-start latency (pod scheduling + container startup time).
Debugging scaling decisions requires understanding both KEDA and HPA (KEDA creates HPA objects under the hood).

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0
  maxReplicaCount: 50
  cooldownPeriod: 300
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
      queueLength: "5"
      awsRegion: us-east-1

Cluster Autoscaler#

Cluster Autoscaler watches for pods that cannot be scheduled due to insufficient node resources. When it detects pending pods, it adds nodes from pre-configured node groups. When nodes are underutilized, it drains and removes them.

Choose Cluster Autoscaler when:

You are running a standard cloud-managed cluster (EKS, GKE, AKS).
Your node groups are pre-defined and relatively uniform.
You want a mature, well-understood node scaling solution.

Limitations:

Slow: provisioning a new node typically takes 3-10 minutes (instance launch + kubelet registration + pod scheduling).
Limited instance type flexibility within a single node group.
Scaling decisions are reactive – pods must be pending before nodes are added.
Configuration is cloud-specific (ASGs on AWS, MIGs on GCP, VMSS on Azure).

Karpenter#

Karpenter is a next-generation node provisioner that selects optimal instance types based on pending pod requirements. Instead of pre-defined node groups, you define NodePool constraints and Karpenter picks the best-fit instance from all available types.

Choose Karpenter when:

You are on AWS (primary support) or GKE (support emerging).
You want faster node provisioning (typically under 2 minutes).
You have diverse workload sizes and want optimal instance selection (right-sized nodes, fewer wasted resources).
You want native spot/preemptible instance support with automatic fallback.
Cost optimization is a priority and you need bin-packing across instance families.

Limitations:

AWS is the primary supported cloud. GKE support is available but newer. Azure support is not yet production-ready.
Less battle-tested than Cluster Autoscaler in edge cases.
Requires understanding NodePool and NodeClass CRDs.
Consolidation (replacing underutilized nodes with smaller ones) can cause pod disruption.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand", "spot"]
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64", "arm64"]
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["c", "m", "r"]
  limits:
    cpu: 1000
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

Combination Patterns#

Most production clusters use multiple autoscalers together. These are the proven combinations:

HPA + Cluster Autoscaler (most common): HPA scales pods based on metrics. When new pods cannot be scheduled, Cluster Autoscaler adds nodes. When pods are removed, underutilized nodes are drained. This is the default starting point for most clusters.

HPA + Karpenter (fast, cost-optimized): Same pod-level scaling as above, but Karpenter provisions optimally-sized nodes faster and with better instance type selection. Choose this over HPA + Cluster Autoscaler when speed and cost matter.

VPA (Off mode) + HPA: VPA runs in recommendation-only mode, providing right-sizing data that feeds into your CI/CD pipeline or capacity planning. HPA handles actual scaling. This avoids the VPA/HPA conflict while getting the benefits of both.

KEDA + Karpenter: KEDA scales pods based on queue depth or external events, including scale-to-zero. Karpenter rapidly provisions nodes when KEDA scales up. This is the most responsive combination for event-driven workloads.

KEDA + Cluster Autoscaler: Same pattern as above but with slower node provisioning. Acceptable when scale-from-zero latency of a few extra minutes is tolerable.

Choose X When – Summary#

Scenario	Recommended Approach
Stateless HTTP services, standard scaling	HPA + Cluster Autoscaler
Queue workers, event processors	KEDA + Karpenter (or Cluster Autoscaler)
Cost-sensitive with diverse workloads (AWS)	HPA + Karpenter with spot instances
Cannot scale horizontally (singleton, database)	VPA in Auto mode
Need right-sizing data without disruption	VPA in Off mode + HPA
Scale-to-zero required	KEDA
Fastest node provisioning	Karpenter
Broadest cloud support for node scaling	Cluster Autoscaler
New cluster, starting simple	HPA + Cluster Autoscaler, add complexity as needed

Start with HPA and Cluster Autoscaler. Add KEDA when you have event-driven workloads. Switch to Karpenter when node provisioning speed or cost optimization becomes a bottleneck. Use VPA in Off mode from day one to collect right-sizing data, even if you do not act on it immediately.