Kubernetes Cost Optimization#

Most Kubernetes clusters run at 15-30% actual CPU utilization but are billed for the full provisioned capacity. The gap between what you reserve and what you use is pure waste. This article covers the practical workflow for finding and eliminating that waste.

The Cost Problem: Requests vs Actual Usage#

Kubernetes resource requests are the foundation of cost. When a pod requests 4 CPUs, the scheduler reserves 4 CPUs on a node regardless of whether the pod ever uses more than 0.1 CPU. The node is sized (and billed) based on what is reserved, not what is consumed.

Provisioned capacity:  20 nodes x 8 CPU = 160 CPUs
Total requests:        120 CPUs (75% of capacity — looks healthy)
Actual usage:          30 CPUs  (19% of capacity — massive waste)

The 90 CPUs of requested-but-unused capacity represents real money. On AWS at roughly $0.05/CPU-hour for m5.2xlarge on-demand, that is over $3,200/month in wasted compute for this one cluster.

Step 1: Measure Actual Usage#

Before you can rightsize anything, you need usage data. The minimum data collection period is 7 days to capture weekly traffic patterns. Two weeks is better.

Quick Check with kubectl#

# Current CPU and memory usage per pod
kubectl top pods -n production --sort-by=cpu

# Current usage per node
kubectl top nodes

kubectl top shows a point-in-time snapshot. It is useful for a quick gut check but not sufficient for rightsizing decisions.

Prometheus Queries for Historical Data#

Prometheus gives you the historical data you need. These queries return the p95 CPU and memory usage over the past 7 days, grouped by pod and container.

# p95 CPU usage per container over 7 days (in cores)
quantile_over_time(0.95,
  rate(container_cpu_usage_seconds_total{
    namespace="production",
    container!=""
  }[5m])[7d:]
) by (pod, container)

# p95 memory working set per container over 7 days (in bytes)
quantile_over_time(0.95,
  container_memory_working_set_bytes{
    namespace="production",
    container!=""
  }[7d:]
) by (pod, container)

Compare Requests to Usage#

The key comparison is between what the pod requested and what it actually used:

# CPU waste ratio per container: requested / actual
# Values > 2 mean you are requesting more than double what you use
sum by (namespace, pod, container) (
  kube_pod_container_resource_requests{resource="cpu"}
)
/
sum by (namespace, pod, container) (
  rate(container_cpu_usage_seconds_total{container!=""}[5m])
)

Any container with a ratio above 3-4x is a strong rightsizing candidate. In practice, it is common to find pods requesting 4 CPUs while using 0.1, or requesting 8Gi memory while using 500Mi.

Step 2: Tools for Rightsizing#

VPA in Recommendation Mode#

The Vertical Pod Autoscaler in Off mode (also called recommendation-only mode) watches actual usage and generates resource recommendations without changing anything:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"  # recommendation only, no automatic changes

After a few days, check recommendations:

kubectl get vpa my-app-vpa -n production -o jsonpath='{.status.recommendation.containerRecommendations}' | jq .

The output includes lowerBound, target, uncappedTarget, and upperBound for both CPU and memory. Use the target value as a starting point and add a 20% buffer.

Goldilocks#

Goldilocks creates VPA objects for every deployment in a namespace and presents the recommendations in a web dashboard. It gives you a single view of rightsizing opportunities across all workloads.

# Install with Helm
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks --create-namespace

# Enable for a namespace (creates VPA objects automatically)
kubectl label namespace production goldilocks.fairwinds.com/enabled=true

# Port-forward the dashboard
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

Kubecost#

Kubecost provides full cost allocation and connects resource usage to actual cloud billing data. It calculates per-pod, per-namespace, and per-label costs using the actual pricing from your cloud provider.

helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost --create-namespace \
  --set kubecostToken="YOUR_TOKEN"

Kubecost’s savings page directly shows you the dollar amount you would save by rightsizing each workload. The kubectl-cost plugin provides CLI access:

# Cost by namespace for the last 7 days
kubectl cost namespace --window 7d

# Cost by deployment in production namespace
kubectl cost deployment --namespace production --window 7d

Step 3: Adjust Resource Requests#

The rightsizing formula is straightforward:

new_request = p95_actual_usage * 1.2

Take the 95th percentile of actual usage over at least 7 days and add a 20% buffer. This handles normal spikes while eliminating the gross over-provisioning.

For a deployment currently requesting 4 CPUs and 8Gi memory but actually using 0.5 CPU at p95 and 1.2Gi memory at p95:

resources:
  requests:
    cpu: "600m"      # was 4000m, p95 usage 500m + 20% buffer
    memory: "1500Mi" # was 8Gi, p95 usage 1200Mi + 20% buffer
  limits:
    cpu: "1200m"     # 2x request for burst headroom
    memory: "2Gi"    # firm limit to prevent OOMKill surprises

Roll changes out gradually. Rightsize one deployment at a time, monitor for a few days, then proceed. A batch update across every deployment simultaneously is asking for trouble.

Step 4: Namespace-Level Cost Allocation#

For chargeback or showback, use ResourceQuotas combined with labels:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"

Label namespaces and pods with team or cost-center identifiers so Kubecost or your cloud cost tool can group spending:

kubectl label namespace team-backend cost-center=engineering team=backend

Step 5: Node-Level Optimization#

Bin Packing vs Spreading#

Bin packing fills each node as fully as possible before starting new ones. This minimizes the number of running nodes and reduces cost, but a single node failure impacts more pods.

Spreading distributes pods evenly across nodes. Better for availability, but you end up with more partially filled nodes.

For cost optimization, lean toward bin packing. Configure your autoscaler accordingly:

# Cluster Autoscaler: prefer least-waste expander
# (picks the node group that wastes the least capacity after scheduling)
--expander=least-waste

Karpenter handles this natively with its consolidation feature, which actively replaces underutilized nodes with smaller ones:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

Node Sizing Strategy#

Fewer large nodes generally beat many small nodes for cost efficiency. Every node runs system pods (kube-proxy, CNI agent, monitoring agents) that consume fixed overhead. Ten nodes with 8 CPUs each waste more on overhead than five nodes with 16 CPUs each, while providing the same total capacity.

The exception is when you need topology spreading for availability, where more nodes across more failure domains is the correct trade-off.

Savings Strategies Ranked by Impact#

Rightsize resource requests – Typically the single biggest cost reduction. Most clusters see 30-50% savings.
Autoscale cluster nodes down – Stop paying for idle nodes. Cluster Autoscaler or Karpenter should remove nodes that have been underutilized for more than 10 minutes.
Use spot instances for non-critical workloads – 60-90% savings on burst capacity (see the spot instances article for details).
Reserved capacity for baseline – Committed use discounts (GCP), Reserved Instances (AWS), or Reserved VM Instances (Azure) for workloads that always run. Typically 30-60% savings.
Delete unused resources – Orphaned PersistentVolumeClaims, idle load balancers, abandoned namespaces. These accumulate silently.

# Find PVCs not mounted by any pod
kubectl get pvc --all-namespaces -o json | jq -r '
  .items[] | select(.status.phase == "Bound") |
  "\(.metadata.namespace)/\(.metadata.name)"
' | while read pvc; do
  ns=$(echo $pvc | cut -d/ -f1)
  name=$(echo $pvc | cut -d/ -f2)
  used=$(kubectl get pods -n $ns -o json | jq --arg pvc "$name" \
    '[.items[].spec.volumes[]? | select(.persistentVolumeClaim.claimName == $pvc)] | length')
  if [ "$used" = "0" ]; then
    echo "UNUSED: $pvc"
  fi
done

Common Gotchas#

Setting requests too low causes eviction. When a node comes under memory pressure, the kubelet evicts pods whose actual memory usage exceeds their requests. Under-requesting to save cost and then losing pods under load is worse than the original overspending. Always maintain a buffer above p95.

HPA minimum replicas waste money at night. If your HPA has minReplicas: 10 but traffic drops to near zero overnight, those 10 pods sit idle for 8-10 hours a day. Consider lower minimums with a CronJob that adjusts the HPA minReplicas for off-hours, or use KEDA with scaling-to-zero capability.

Cost visibility lag. Cloud billing data is delayed 24-48 hours. After making rightsizing changes, do not expect to see cost reduction in your cloud bill for several days. Use Kubecost or Prometheus metrics for near-real-time verification.

Practical Example: 20 Nodes to 12#

A production EKS cluster running 20 m5.2xlarge nodes (8 CPU, 32Gi each). Monthly compute cost: approximately $11,000.

Analysis showed total CPU requests of 110 cores against actual p95 usage of 35 cores. Memory requests totaled 280Gi against actual p95 of 95Gi.

The rightsizing process over three weeks:

Deployed VPA in Off mode for all deployments. Waited 10 days.
Reduced CPU requests by 50-80% across 40 deployments based on VPA recommendations plus buffer.
Reduced memory requests by 40-60% across the same deployments.
Enabled Karpenter consolidation to replace underutilized nodes.
Cluster settled at 12 nodes with headroom. Monthly cost dropped to approximately $6,600 – a 40% reduction.

The key insight: nobody had revisited resource requests since the initial deployment 18 months earlier. The defaults from the original Helm chart values were wildly generous, and actual usage had never grown into them.