Kubernetes Cost Audit and Reduction#
Kubernetes clusters accumulate cost waste silently. Resource requests padded “just in case” during initial deployment never get revisited. Load balancers created for debugging stay running. PVCs from deleted applications persist. Over six months, a cluster originally running at $5,000/month can drift to $12,000 with no corresponding increase in actual workload.
This operational plan works through cost reduction systematically, starting with visibility (you cannot cut what you cannot see), moving through quick wins, then tackling the larger structural optimizations that require data collection and careful rollout.
Estimated timeline: 4-6 weeks. Phase 1-2 can be completed in the first week. Phases 3-4 require 2-3 weeks of data collection. Phase 5-6 are ongoing.
Phase 1 – Visibility (Day 1-3)#
Before cutting anything, establish where money is going. Without a cost baseline, you cannot measure progress or prioritize effort.
Step 1: Install a Cost Visibility Tool#
The fastest path is Kubecost (free tier supports a single cluster) or OpenCost (fully open source, CNCF project).
# Option A: Kubecost (free tier)
helm repo add kubecost https://kubecost.github.io/cost-analyzer
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--set kubecostToken="" \
--set prometheus.server.persistentVolume.size=32Gi
# Option B: OpenCost
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm install opencost opencost/opencost \
--namespace opencost \
--create-namespaceStep 2: If No Tool – Estimate from Prometheus#
If you cannot install a dedicated tool, calculate cost from resource requests and node pricing. This gives a rougher picture but is still actionable.
# Total CPU requests across the cluster (in cores)
sum(kube_pod_container_resource_requests{resource="cpu"})
# Total memory requests across the cluster (in bytes, convert to GiB)
sum(kube_pod_container_resource_requests{resource="memory"}) / 1024 / 1024 / 1024
# CPU requests per namespace
sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})
# Memory requests per namespace
sum by (namespace) (kube_pod_container_resource_requests{resource="memory"}) / 1024 / 1024 / 1024Multiply total CPU cores by your per-core-hour cost (check your cloud provider pricing page for the instance type your nodes use) and total GiB by per-GiB-hour cost. This gives you a rough monthly compute cost. Add storage (number of PVs times per-GB cost) and networking (load balancers at roughly $15-20/month each plus data transfer).
Step 3: Build the Cost Baseline#
Document current monthly spend across these categories:
# Count load balancers (each costs $15-20/month on most cloud providers)
kubectl get svc --all-namespaces -o json | \
jq '[.items[] | select(.spec.type=="LoadBalancer")] | length'
# Count persistent volume claims and total storage
kubectl get pvc --all-namespaces -o json | \
jq '[.items[] | .spec.resources.requests.storage] | length'
# Count nodes and their instance types
kubectl get nodes -o custom-columns=NAME:.metadata.name,TYPE:.metadata.labels."node\.kubernetes\.io/instance-type",CAPACITY_CPU:.status.capacity.cpu,CAPACITY_MEM:.status.capacity.memoryStep 4: Identify Cost per Namespace and Workload#
# Top 10 namespaces by CPU request cost
topk(10, sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"}))
# Top 10 individual pods by CPU request
topk(10, sum by (namespace, pod) (kube_pod_container_resource_requests{resource="cpu"}))
# Ratio of actual CPU usage to requests (lower = more waste)
sum by (namespace) (rate(container_cpu_usage_seconds_total[5m]))
/
sum by (namespace) (kube_pod_container_resource_requests{resource="cpu"})A usage-to-request ratio below 0.3 indicates significant over-provisioning in that namespace.
Phase 1 Output: A document listing monthly cost breakdown by category (compute, storage, networking), cost per namespace, and the usage-to-request ratio per namespace. This is your baseline.
Phase 2 – Quick Wins (Week 1)#
These are low-risk changes that reduce cost immediately without affecting application stability.
Step 5: Find and Remove Idle Resources#
# PVCs not mounted to any running pod
kubectl get pvc --all-namespaces -o json | jq -r '
.items[] |
select(.status.phase == "Bound") |
"\(.metadata.namespace)/\(.metadata.name) - \(.spec.resources.requests.storage)"
' > all-pvcs.txt
kubectl get pods --all-namespaces -o json | jq -r '
.items[].spec.volumes[]? |
select(.persistentVolumeClaim) |
.persistentVolumeClaim.claimName
' | sort -u > mounted-pvcs.txt
# PVCs in all-pvcs.txt but not in mounted-pvcs.txt are candidates for deletion# LoadBalancer Services -- check if they are receiving traffic
kubectl get svc --all-namespaces -o json | jq -r '
.items[] | select(.spec.type=="LoadBalancer") |
"\(.metadata.namespace)/\(.metadata.name)"
'
# Cross-reference with ingress controller logs or cloud provider LB metrics
# Each unused LB costs $15-20/month minimum
# Namespaces with no running pods (abandoned environments)
kubectl get namespaces -o json | jq -r '.items[].metadata.name' | while read ns; do
count=$(kubectl get pods -n "$ns" --no-headers 2>/dev/null | wc -l)
if [ "$count" -eq 0 ]; then
echo "Empty namespace: $ns"
fi
doneStep 6: Delete Orphaned Resources#
After identifying candidates, delete them. Be cautious – verify each resource is genuinely unused before deletion.
# Delete an unused PVC (after confirming no pod needs it)
kubectl delete pvc <name> -n <namespace>
# Delete an unused LoadBalancer Service
kubectl delete svc <name> -n <namespace>
# Delete an abandoned namespace (this deletes EVERYTHING in it)
kubectl delete namespace <name>Step 7: Right-Size Obvious Over-Provisioning#
Look for pods where requests dramatically exceed actual usage:
# Pods requesting more than 10x their actual CPU usage
(
kube_pod_container_resource_requests{resource="cpu"}
/
rate(container_cpu_usage_seconds_total[1h])
) > 10For any pod requesting 4 CPU but consistently using 0.1 CPU, reduce the request immediately. A 40x over-provision is never intentional.
Step 8: Scale Down Non-Production During Off-Hours#
# CronJob to scale down staging at 8 PM
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-staging
namespace: staging
spec:
schedule: "0 20 * * 1-5"
jobTemplate:
spec:
template:
spec:
serviceAccountName: scaler
containers:
- name: scaler
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl scale deployment --all -n staging --replicas=0
restartPolicy: OnFailure
---
# CronJob to scale up staging at 8 AM
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-staging
namespace: staging
spec:
schedule: "0 8 * * 1-5"
jobTemplate:
spec:
template:
spec:
serviceAccountName: scaler
containers:
- name: scaler
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl scale deployment app-api -n staging --replicas=2
kubectl scale deployment app-web -n staging --replicas=1
restartPolicy: OnFailurePhase 2 Output: List of deleted resources and estimated monthly savings. Typical savings from quick wins range from 10-25% of total spend.
Phase 3 – Systematic Rightsizing (Week 2-3)#
Quick wins catch the obvious waste. Systematic rightsizing addresses the pervasive problem of slightly-too-generous resource requests across every workload.
Step 9: Install VPA in Recommendation Mode#
VPA (Vertical Pod Autoscaler) watches actual resource usage over time and calculates optimal requests. Install it in recommendation-only mode so it does not automatically change anything.
# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Create VPA objects in "Off" mode for each deployment
# This collects data without making changes
for deploy in $(kubectl get deployments -n production -o name); do
name=$(echo "$deploy" | cut -d/ -f2)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ${name}-vpa
namespace: production
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: ${name}
updatePolicy:
updateMode: "Off"
EOF
doneStep 10: Wait for Stable Recommendations#
VPA needs 1-2 weeks of data to produce reliable recommendations. Check progress:
# View VPA recommendations
kubectl get vpa -n production -o json | jq -r '
.items[] |
"\(.metadata.name):" +
" CPU lower=\(.status.recommendation.containerRecommendations[0].lowerBound.cpu // "pending")" +
" CPU target=\(.status.recommendation.containerRecommendations[0].target.cpu // "pending")" +
" CPU upper=\(.status.recommendation.containerRecommendations[0].upperBound.cpu // "pending")" +
" MEM target=\(.status.recommendation.containerRecommendations[0].target.memory // "pending")"
'Step 11: Apply Recommendations#
Start with the workloads that have the largest gap between current requests and VPA recommendations. Use the p95 of actual usage plus a 20% buffer as the new request value:
# Example: VPA recommends 150m CPU, current request is 1000m
# New request: 150m * 1.2 = 180m (round to 200m for cleanliness)
kubectl set resources deployment/my-app -n production \
--requests=cpu=200m,memory=256Mi \
--limits=cpu=500m,memory=512MiStep 12: Verify Stability After Rightsizing#
# Check for OOMKilled events (memory set too low)
kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp'
# Check for CPU throttling (CPU limit too low)
# This Prometheus query shows containers being throttled
sum by (pod) (rate(container_cpu_cfs_throttled_periods_total[5m]))
/
sum by (pod) (rate(container_cpu_cfs_periods_total[5m])) > 0.5
# Check pod restarts (general instability)
kubectl get pods -n production --sort-by='.status.containerStatuses[0].restartCount'If OOMKilled events appear, increase memory requests by 25% and monitor again. Do not reduce resources below the point where the application becomes unstable under normal or peak load.
Phase 4 – Node Optimization (Week 3-4)#
After rightsizing pods, node-level optimization becomes possible because pods now request closer to their actual needs, allowing tighter bin-packing.
Step 13: Analyze Node Utilization#
# Average CPU utilization per node (should be >60% for cost efficiency)
avg by (node) (
sum by (node) (rate(container_cpu_usage_seconds_total[5m]))
/
sum by (node) (kube_node_status_capacity{resource="cpu"})
)
# Average memory utilization per node
avg by (node) (
sum by (node) (container_memory_working_set_bytes)
/
sum by (node) (kube_node_status_capacity{resource="memory"})
)If average utilization is below 40%, nodes are significantly under-used.
Step 14: Tighten Autoscaler Settings#
# For Cluster Autoscaler: reduce scale-down threshold and delay
# These flags are set on the Cluster Autoscaler deployment
# --scale-down-utilization-threshold=0.5 (default, consider 0.4)
# --scale-down-delay-after-add=10m (default, consider 5m)
# --scale-down-unneeded-time=10m (default, consider 5m)
# Reduce minimum node count if current minimum is above actual need
# Cloud-specific: update the managed node group or auto scaling group
# AWS EKS example:
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name main-pool \
--scaling-config minSize=2,maxSize=20,desiredSize=3Step 15: Consider Larger Node Types#
Fewer larger nodes are often cheaper than many small nodes because system overhead (kubelet, kube-proxy, DaemonSets) per node is fixed. A cluster with 20 small nodes runs 20 copies of every DaemonSet. Consolidating to 8 larger nodes saves the overhead of 12 DaemonSet replicas plus 12 nodes worth of system reservation.
Step 16: Implement Spot/Preemptible Nodes#
For fault-tolerant workloads (stateless web servers, batch jobs, CI runners), spot instances save 60-90% over on-demand pricing.
# Karpenter NodePool with spot instances
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-pool
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.xlarge", "m5.2xlarge", "m5a.xlarge", "m5a.2xlarge"]
limits:
cpu: "100"
memory: 400Gi
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30sStep 17: Reserve Baseline Capacity#
For workloads that run 24/7, Reserved Instances (AWS), Committed Use Discounts (GCP), or Reserved VM Instances (Azure) save 30-60% over on-demand. Calculate your baseline by looking at the minimum node count over the past 30 days – that floor is your reservation target.
Phase 4 Verification: Node count should decrease. Pod scheduling should remain healthy (no pending pods). Check kubectl get pods --field-selector=status.phase=Pending --all-namespaces regularly.
Phase 5 – Storage and Networking (Ongoing)#
Step 18: Audit Storage#
# List all PVs with their size and storage class
kubectl get pv -o custom-columns=\
NAME:.metadata.name,\
CAPACITY:.spec.capacity.storage,\
CLASS:.spec.storageClassName,\
STATUS:.status.phase,\
CLAIM:.spec.claimRef.name
# Check if premium storage is used where standard would suffice
# Premium SSD: $0.17/GB/month (Azure), gp3: $0.08/GB/month (AWS)
# Standard HDD: $0.04/GB/month
# Decision: use premium only for databases, standard for everything elsePVs are typically grow-only in production. You cannot shrink a PV without recreating it and migrating data. Focus on ensuring new PVCs use appropriate storage classes rather than trying to resize existing ones.
Step 19: Consolidate Load Balancers#
# Count LoadBalancer services
kubectl get svc --all-namespaces -o json | \
jq '[.items[] | select(.spec.type=="LoadBalancer")] | length'
# Each LB costs $15-20/month. If you have 10 LoadBalancer services,
# that's $150-200/month. Consolidate to a single ingress controller
# with multiple Ingress resources routing to different backends.Replace individual LoadBalancer services with ClusterIP services behind a shared ingress controller. A single NGINX ingress controller handles hundreds of backends.
Step 20: Minimize Cross-AZ Traffic#
In AWS, cross-AZ data transfer costs $0.01/GB in each direction. For high-throughput services, this adds up.
# Enable topology-aware routing (Kubernetes 1.27+)
apiVersion: v1
kind: Service
metadata:
name: my-app
annotations:
service.kubernetes.io/topology-mode: Auto
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080This causes kube-proxy to prefer routing traffic to pods in the same availability zone, reducing cross-AZ data transfer.
Phase 6 – Ongoing Governance#
Step 21: Set Up Cost Alerting#
# Prometheus alerting rule for cost anomaly
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cost-alerts
namespace: monitoring
spec:
groups:
- name: cost
rules:
- alert: HighCPURequestGrowth
expr: |
sum(kube_pod_container_resource_requests{resource="cpu"})
>
sum(kube_pod_container_resource_requests{resource="cpu"} offset 1d) * 1.2
for: 1h
annotations:
summary: "Total CPU requests increased more than 20% in 24 hours"
- alert: NewLoadBalancerCreated
expr: |
count(kube_service_spec_type{type="LoadBalancer"})
>
count(kube_service_spec_type{type="LoadBalancer"} offset 1h)
for: 5m
annotations:
summary: "New LoadBalancer service created -- costs $15-20/month"Step 22: Enforce Budgets per Namespace#
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-budget
namespace: team-alpha
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
persistentvolumeclaims: "10"
services.loadbalancers: "1"Step 23: Monthly Review Process#
Establish a monthly cost review that covers:
- Total spend vs. budget vs. previous month
- Cost per namespace and per team (chargeback/showback)
- VPA recommendations that have not been applied
- New resources created that month (PVCs, LBs, namespaces)
- Spot instance interruption rate and coverage
- Reserved instance utilization (are you using what you reserved?)
Expected Savings by Phase#
| Phase | Typical Savings | Effort | Risk |
|---|---|---|---|
| Quick Wins | 10-25% | Low | Low |
| Systematic Rightsizing | 15-30% | Medium | Medium |
| Node Optimization | 10-20% | Medium | Medium |
| Storage and Networking | 5-15% | Low | Low |
| Spot/Reserved | 20-40% on eligible workloads | Medium | Low-Medium |
These ranges compound. A cluster spending $10,000/month can typically be brought to $5,000-7,000/month through the full sequence. The largest single lever is usually rightsizing – most clusters have 2-3x more CPU and memory requested than actually used.