Choosing an Autoscaling Strategy#
Kubernetes autoscaling operates at two distinct layers: pod-level scaling changes how many pods run or how large they are, while node-level scaling changes how many nodes exist in the cluster to host those pods. Getting the right combination of tools at each layer is the key to a system that responds to demand without wasting resources.
The Two Scaling Layers#
Understanding which layer a tool operates on prevents the most common misconfiguration – expecting pod-level scaling to solve node-level capacity problems, or vice versa.
Pod-level scaling (what runs):
- HPA – adjusts replica count
- VPA – adjusts resource requests per pod
- KEDA – adjusts replica count based on external event sources
Node-level scaling (where it runs):
- Cluster Autoscaler – adds/removes nodes based on pending pods
- Karpenter – provisions optimally-sized nodes based on pending pod requirements
In most production clusters, you need at least one tool from each layer.
Comparison Table#
| Criteria | HPA | VPA | KEDA | Cluster Autoscaler | Karpenter |
|---|---|---|---|---|---|
| Scaling dimension | Horizontal (replicas) | Vertical (CPU/memory) | Horizontal (replicas) | Nodes | Nodes |
| Trigger types | CPU, memory, custom metrics | Historical resource usage | 60+ event sources (queues, cron, Prometheus, custom) | Pending pods | Pending pods |
| Scaling speed | Seconds to minutes | Minutes (requires pod restart) | Seconds to minutes | Minutes (3-10 min typical) | Seconds to minutes |
| Scale to zero | No (minReplicas >= 1) | N/A | Yes | Yes (remove empty nodes) | Yes (remove empty nodes) |
| Complexity | Low | Low-Medium | Medium | Low | Medium |
| Maturity | Built-in, GA | Beta/Stable (addon) | CNCF Graduated | Mature, widely deployed | Production-ready, growing adoption |
| Cloud dependency | None | None | None | Cloud-specific node group config | AWS primary, GKE support emerging |
| Cost impact | More pods = more node demand | Right-sized pods = less waste | Efficient scale-to-zero | Matches node supply to demand | Optimal instance selection, spot support |
HPA (Horizontal Pod Autoscaler)#
HPA adds or removes pod replicas based on observed metrics. It is built into Kubernetes and requires only a metrics source (metrics-server for CPU/memory, or a custom metrics adapter for application metrics).
Choose HPA when:
- Your workload is stateless and can scale horizontally.
- Scaling should be driven by CPU utilization, memory utilization, or request rate.
- You need the simplest, most battle-tested autoscaling option.
- Your application handles the transition well when replicas are added or removed.
Limitations:
- Cannot scale to zero –
minReplicasmust be at least 1. - Does not help if pods are incorrectly sized (requesting too much or too little CPU/memory).
- Scaling on memory is often unreliable because many applications do not release memory predictably.
- Custom metrics require deploying a metrics adapter (Prometheus adapter, Datadog adapter, etc.).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60The behavior section is critical for production. Without a stabilization window, HPA will flap – scaling down immediately after a traffic spike subsides, then scaling back up when the next request burst arrives.
VPA (Vertical Pod Autoscaler)#
VPA analyzes historical resource usage and adjusts pod CPU and memory requests. In Auto mode, it evicts pods and recreates them with updated requests. In Off mode (recommendation-only), it provides suggestions without changing anything.
Choose VPA when:
- You need to right-size workloads that were deployed with guessed resource requests.
- The workload cannot scale horizontally (singletons, leader-elected processes, some databases).
- You want data-driven resource recommendations without manual profiling.
- Running in
Offmode to feed recommendations into your deployment pipeline.
Limitations:
- In
Automode, VPA evicts pods to apply new resource requests. This causes brief disruption. - VPA and HPA cannot both target CPU on the same workload. HPA scales replicas based on CPU utilization percentages, while VPA changes the denominator (the request value) that those percentages are calculated from. This creates a feedback loop. You can combine VPA (targeting memory) with HPA (targeting CPU), or use VPA in
Offmode alongside HPA. - VPA needs several hours of data before making useful recommendations.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Recommendation only, no pod eviction
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4GiKEDA (Kubernetes Event-Driven Autoscaler)#
KEDA extends HPA with 60+ external event sources as scaling triggers. It can scale Deployments, StatefulSets, Jobs, and custom resources. Its defining feature is the ability to scale to zero replicas and back up when events arrive.
Choose KEDA when:
- You need to scale based on queue depth (SQS, RabbitMQ, Kafka consumer lag, Azure Service Bus).
- You need scale-to-zero capability to save costs during idle periods.
- Your scaling trigger is not a standard Kubernetes metric (Prometheus query, cron schedule, external API).
- You are processing events or messages rather than serving synchronous HTTP traffic.
Limitations:
- Adds an external dependency (KEDA operator) to the cluster.
- More configuration surface area than plain HPA.
- Scale-from-zero has cold-start latency (pod scheduling + container startup time).
- Debugging scaling decisions requires understanding both KEDA and HPA (KEDA creates HPA objects under the hood).
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0
maxReplicaCount: 50
cooldownPeriod: 300
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
queueLength: "5"
awsRegion: us-east-1Cluster Autoscaler#
Cluster Autoscaler watches for pods that cannot be scheduled due to insufficient node resources. When it detects pending pods, it adds nodes from pre-configured node groups. When nodes are underutilized, it drains and removes them.
Choose Cluster Autoscaler when:
- You are running a standard cloud-managed cluster (EKS, GKE, AKS).
- Your node groups are pre-defined and relatively uniform.
- You want a mature, well-understood node scaling solution.
Limitations:
- Slow: provisioning a new node typically takes 3-10 minutes (instance launch + kubelet registration + pod scheduling).
- Limited instance type flexibility within a single node group.
- Scaling decisions are reactive – pods must be pending before nodes are added.
- Configuration is cloud-specific (ASGs on AWS, MIGs on GCP, VMSS on Azure).
Karpenter#
Karpenter is a next-generation node provisioner that selects optimal instance types based on pending pod requirements. Instead of pre-defined node groups, you define NodePool constraints and Karpenter picks the best-fit instance from all available types.
Choose Karpenter when:
- You are on AWS (primary support) or GKE (support emerging).
- You want faster node provisioning (typically under 2 minutes).
- You have diverse workload sizes and want optimal instance selection (right-sized nodes, fewer wasted resources).
- You want native spot/preemptible instance support with automatic fallback.
- Cost optimization is a priority and you need bin-packing across instance families.
Limitations:
- AWS is the primary supported cloud. GKE support is available but newer. Azure support is not yet production-ready.
- Less battle-tested than Cluster Autoscaler in edge cases.
- Requires understanding NodePool and NodeClass CRDs.
- Consolidation (replacing underutilized nodes with smaller ones) can cause pod disruption.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30sCombination Patterns#
Most production clusters use multiple autoscalers together. These are the proven combinations:
HPA + Cluster Autoscaler (most common): HPA scales pods based on metrics. When new pods cannot be scheduled, Cluster Autoscaler adds nodes. When pods are removed, underutilized nodes are drained. This is the default starting point for most clusters.
HPA + Karpenter (fast, cost-optimized): Same pod-level scaling as above, but Karpenter provisions optimally-sized nodes faster and with better instance type selection. Choose this over HPA + Cluster Autoscaler when speed and cost matter.
VPA (Off mode) + HPA: VPA runs in recommendation-only mode, providing right-sizing data that feeds into your CI/CD pipeline or capacity planning. HPA handles actual scaling. This avoids the VPA/HPA conflict while getting the benefits of both.
KEDA + Karpenter: KEDA scales pods based on queue depth or external events, including scale-to-zero. Karpenter rapidly provisions nodes when KEDA scales up. This is the most responsive combination for event-driven workloads.
KEDA + Cluster Autoscaler: Same pattern as above but with slower node provisioning. Acceptable when scale-from-zero latency of a few extra minutes is tolerable.
Choose X When – Summary#
| Scenario | Recommended Approach |
|---|---|
| Stateless HTTP services, standard scaling | HPA + Cluster Autoscaler |
| Queue workers, event processors | KEDA + Karpenter (or Cluster Autoscaler) |
| Cost-sensitive with diverse workloads (AWS) | HPA + Karpenter with spot instances |
| Cannot scale horizontally (singleton, database) | VPA in Auto mode |
| Need right-sizing data without disruption | VPA in Off mode + HPA |
| Scale-to-zero required | KEDA |
| Fastest node provisioning | Karpenter |
| Broadest cloud support for node scaling | Cluster Autoscaler |
| New cluster, starting simple | HPA + Cluster Autoscaler, add complexity as needed |
Start with HPA and Cluster Autoscaler. Add KEDA when you have event-driven workloads. Switch to Karpenter when node provisioning speed or cost optimization becomes a bottleneck. Use VPA in Off mode from day one to collect right-sizing data, even if you do not act on it immediately.