What Capacity Planning Solves#
Running out of capacity during a traffic spike causes outages. Over-provisioning wastes money continuously. Capacity planning is the process of measuring what you use now, projecting what you will need, and ensuring resources are available before demand arrives. Without it, you are either constantly firefighting resource exhaustion or explaining to finance why your cloud bill doubled.
Capacity planning is not a one-time exercise. It is a recurring process – monthly for fast-growing services, quarterly for stable ones.
Step 1: Measure the Current Baseline#
You cannot plan capacity without knowing current utilization. Measure these resources across every service:
CPU utilization: Average and peak, per pod and per node.
# Average CPU usage per pod over 24 hours
avg_over_time(
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])[24h:5m]
) by (pod)
# Peak CPU usage per node over 7 days
max_over_time(
instance:node_cpu_utilisation:rate5m[7d]
) by (instance)Memory utilization: Current usage vs requests vs limits, and RSS (resident set size) trends.
# Memory usage as percentage of request
container_memory_working_set_bytes{namespace="production"}
/
kube_pod_container_resource_requests{resource="memory", namespace="production"}
# Memory RSS trend over 30 days (detect leaks)
avg_over_time(
container_memory_rss{namespace="production"}[30d:1h]
) by (pod)Disk utilization: Persistent volume usage and growth rate.
# PV usage percentage
kubelet_volume_stats_used_bytes
/
kubelet_volume_stats_capacity_bytes
# Disk growth rate (bytes per day)
predict_linear(
kubelet_volume_stats_used_bytes[7d], 86400
) - kubelet_volume_stats_used_bytesNetwork: Bandwidth consumption, connection counts, and DNS query rates.
# Network throughput per pod
rate(container_network_transmit_bytes_total{namespace="production"}[5m]) by (pod)Application-specific metrics: Database connections, queue depth, cache hit rate, active sessions.
Capture these metrics with kubectl for a quick snapshot:
# Current resource usage across all pods in a namespace
kubectl top pods -n production --sort-by=cpu
# Node-level resource usage
kubectl top nodes
# Resource requests vs limits for all pods
kubectl get pods -n production -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
CPU_LIM:.spec.containers[0].resources.limits.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
MEM_LIM:.spec.containers[0].resources.limits.memory
# PersistentVolume usage
kubectl get pv -o custom-columns=\
NAME:.metadata.name,\
CAPACITY:.spec.capacity.storage,\
STATUS:.status.phase,\
CLAIM:.spec.claimRef.nameBuild a baseline table:
Service: payment-api
Date: 2026-02-22
Replicas: 4
CPU request per pod: 250m
CPU actual per pod (avg): 120m (48% of request)
CPU actual per pod (peak): 210m (84% of request)
Memory request per pod: 512Mi
Memory actual per pod (avg): 340Mi (66% of request)
Memory actual per pod (peak): 480Mi (94% of request)
Requests per second (avg): 200
Requests per second (peak): 850
Database connections (avg): 12 per pod
Database connections (peak): 45 per podStep 2: Project Growth#
Growth projection estimates future resource needs based on historical trends and business context.
Linear projection: The simplest model. Measure resource consumption at two points in time and extend the line.
# Predict CPU usage 90 days from now based on 30-day trend
predict_linear(
rate(container_cpu_usage_seconds_total{pod=~"payment-api.*"}[5m])[30d:1h],
90 * 86400
)Business-driven projection: Talk to product and sales teams. If the company expects to double its user base in 6 months, linear projection from current metrics will underestimate needs. Key questions:
- Are new markets or regions being launched?
- Are there planned marketing campaigns that will spike traffic?
- Is a new feature expected to change usage patterns (e.g., adding real-time notifications increases WebSocket connections)?
- Are there contractual commitments to onboard large customers?
Traffic-based projection: Map resource consumption to traffic volume. If 1000 requests per second requires 4 pods at 250m CPU each, then 2000 requests per second will require roughly 8 pods. This is more useful than time-based projection because it ties capacity to a measurable driver.
Current: 200 req/s average, 4 pods, 120m CPU/pod
Per-request CPU cost: 120m * 4 / 200 = 2.4m per request/s
Projected: 400 req/s average (6 months)
Required CPU: 400 * 2.4m = 960m = ~4 pods at 250m request
Accounting for peak (4.25x average): 1700 req/s peak
Peak CPU: 1700 * 2.4m = 4080m = ~17 pods at 250m requestThis reveals that you might run 4 pods normally but need to scale to 17 during peak. Autoscaling must handle that range.
Step 3: Calculate Headroom#
Headroom is the buffer between current capacity and maximum capacity. It absorbs unexpected spikes, handles the time lag of autoscaling, and prevents the system from running at its absolute limit.
Target utilization: Do not plan for 100% utilization. Common targets:
- CPU: 60-70% average utilization target (leaves room for spikes)
- Memory: 70-80% average utilization target (memory spikes can cause OOM kills)
- Disk: 70% utilization trigger for expansion (disk operations degrade severely above 80%)
- Database connections: 60% of pool maximum (connection storms during restarts)
Headroom formula:
required_capacity = current_usage / target_utilization_percentage
headroom = required_capacity - current_usage
Example:
Current CPU usage: 4 cores
Target utilization: 65%
Required capacity: 4 / 0.65 = 6.15 cores
Headroom needed: 6.15 - 4 = 2.15 cores (54% over current usage)Autoscaler lag headroom: If your Horizontal Pod Autoscaler takes 3 minutes to provision new pods and traffic can spike 2x in 1 minute, you need enough standing capacity to absorb 3 minutes of spike before autoscaling kicks in.
spike_duration_before_scaling = 3 minutes
spike_magnitude = 2x current traffic
burst_headroom = current_capacity * (spike_magnitude - 1) * (spike_duration / total_spike_duration)Step 4: Configure Scaling Triggers#
Autoscaling translates capacity plans into automated responses.
Horizontal Pod Autoscaler (HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-api
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
minReplicas: 4
maxReplicas: 20
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"Key settings explained:
- minReplicas: 4: Never go below 4 pods. This is your standing capacity that handles baseline traffic and absorbs spikes during autoscaler lag.
- maxReplicas: 20: The ceiling. Set this based on your capacity plan, not arbitrarily. Going higher than your infrastructure supports (node capacity, database connection limits) creates false confidence.
- scaleUp stabilization: 60s: Wait 60 seconds before scaling up to avoid reacting to momentary blips. For latency-sensitive services, reduce this.
- scaleDown stabilization: 300s: Wait 5 minutes before scaling down. Aggressive scale-down causes flapping when traffic oscillates.
- scaleUp rate: 50% per 60s: Add up to 50% more pods each minute. 4 pods can become 6, then 9, then 14. Fast enough for most spikes.
- scaleDown rate: 25% per 120s: Remove at most 25% of pods every 2 minutes. Slow scale-down is safer.
Vertical Pod Autoscaler (VPA): Adjusts CPU and memory requests based on actual usage. Useful for right-sizing rather than horizontal scaling. Do not use HPA and VPA on the same metric – they conflict.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: payment-api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: payment-api
minAllowed:
cpu: 100m
memory: 256Mi
maxAllowed:
cpu: 2
memory: 2GiCluster Autoscaler: Adds nodes when pods are pending due to insufficient node capacity. Configure in your cloud provider’s managed Kubernetes settings. Set minimum and maximum node counts aligned with your capacity plan and budget.
Step 5: Forecast Costs#
Capacity planning without cost forecasting is incomplete. Map projected resource needs to actual cloud spend.
Current state:
4x m5.xlarge nodes ($0.192/hr each) = $0.768/hr = $561/month
500GB gp3 EBS ($0.08/GB/month) = $40/month
RDS db.r5.large ($0.24/hr) = $175/month
Total: $776/month
6-month projection (2x traffic):
8x m5.xlarge nodes = $1,122/month
1TB gp3 EBS = $80/month
RDS db.r5.xlarge ($0.48/hr) = $350/month
Total: $1,552/month (2x current)Present cost forecasts as a range:
- Conservative (linear growth continues): $1,200/month
- Expected (business plan growth): $1,552/month
- Aggressive (viral growth or large customer onboarding): $2,400/month
Cost optimization levers to include in the forecast:
- Reserved instances or savings plans for baseline capacity (30-50% savings)
- Spot instances for fault-tolerant workloads (60-80% savings)
- Right-sizing underutilized instances (VPA recommendations)
- Autoscaling down during off-peak hours
Step 6: Account for Seasonality#
Many services have predictable traffic patterns that linear projections miss.
Daily patterns: Traffic peaks during business hours and drops overnight. If your baseline measurement was taken at 2 PM, it does not represent 3 AM traffic. Measure over a full week minimum.
Weekly patterns: B2B services see lower traffic on weekends. E-commerce sees higher traffic on weekends. Know your pattern.
Seasonal patterns: Retail peaks in November-December. Tax services peak in March-April. Educational platforms peak in September. Build these into your capacity plan.
Scheduled events: Marketing campaigns, product launches, and sales events. Get dates from the business calendar and pre-scale before the event.
Pre-scaling for known events:
# Scale up payment-api before Black Friday
kubectl scale deployment payment-api -n production --replicas=16
# Or use a CronJob to auto-scale
# Scale up Monday at 8 AM, scale down Friday at 6 PMBetter yet, use KEDA (Kubernetes Event-Driven Autoscaling) with scheduled scaling:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: payment-api-scheduled
spec:
scaleTargetRef:
name: payment-api
minReplicaCount: 4
maxReplicaCount: 20
triggers:
- type: cron
metadata:
timezone: America/New_York
start: "0 8 * * 1-5" # weekday business hours
end: "0 18 * * 1-5"
desiredReplicas: "8"
- type: cron
metadata:
timezone: America/New_York
start: "0 0 20 11 *" # Nov 20 (pre-Black Friday)
end: "0 0 2 12 *" # Dec 2
desiredReplicas: "16"The Recurring Planning Cycle#
Capacity planning is a cycle, not a project.
Weekly: Review autoscaler events. Did any service hit its replica ceiling? Did any pod get OOM-killed? Are disk volumes approaching 70%?
Monthly: Update baselines. Compare actual growth to projections. Adjust forecasts if growth is faster or slower than expected. Review cost against budget.
Quarterly: Full capacity review. Update growth projections with fresh business input. Evaluate whether instance types, database tiers, and storage classes are still appropriate. Present cost forecast to stakeholders.
Annually: Strategic capacity planning. Align with business plans for the year. Negotiate reserved instance commitments. Plan for infrastructure changes (region expansion, multi-cloud, migration).
The output of each cycle is a document that answers: what do we have, what do we need, when do we need it, and what will it cost? Keep these documents in version control alongside your infrastructure code so the planning history is traceable.