Kubernetes FinOps: Decision Framework for Cost Optimization#
FinOps in Kubernetes is the practice of bringing financial accountability to infrastructure spending. The challenge is not a lack of cost-saving techniques – it is knowing which ones to apply first, which combinations work together, and which ones introduce risk that outweighs the savings. This article provides a structured decision framework for selecting and prioritizing Kubernetes cost optimization strategies.
The Five Optimization Levers#
Every Kubernetes cost optimization effort works across five levers. Each has a different risk profile, implementation effort, and savings ceiling.
| Lever | Typical Savings | Risk Level | Implementation Effort | Prerequisites |
|---|---|---|---|---|
| Rightsizing | 30-50% | Low | Medium | Usage data (7+ days) |
| Spot/Preemptible Instances | 60-90% on eligible workloads | Medium | Medium | Fault-tolerant architecture |
| Cluster Autoscaler Tuning | 10-25% | Low-Medium | Low | Running autoscaler |
| Resource Quotas and Governance | Prevents runaway growth | Low | Low | Namespace strategy |
| Cost Allocation and Visibility | Indirect (behavioral) | None | Medium | Labeling standards |
The order matters. Rightsizing almost always delivers the highest immediate return with the lowest risk. Spot instances offer dramatic savings but require architectural readiness. Autoscaler tuning captures incremental savings. Quotas prevent future waste. Cost visibility changes team behavior over time.
Decision Tree: Where to Start#
Start by answering these questions in sequence:
1. Do you have resource usage data (at least 7 days of Prometheus metrics or equivalent)?
- No: Deploy metrics collection first. Install metrics-server and Prometheus. Run VPA in recommendation-only mode. Wait 7-14 days. You cannot rightsize without data.
- Yes: Proceed to rightsizing.
2. Are your resource requests within 2x of actual p95 usage?
- No: Rightsizing is your highest-impact action. Most clusters have requests 3-10x above actual usage.
- Yes: Your requests are reasonably tuned. Move to node-level optimization.
3. Do you have workloads that can tolerate sudden node loss?
- Yes: Spot instances are your next major savings lever. Stateless services, batch jobs, CI runners, and dev/staging environments are candidates.
- No: Focus on autoscaler tuning and bin-packing instead.
4. Are you running multiple teams or projects on shared clusters?
- Yes: Implement resource quotas and cost allocation. Without them, one team’s growth silently consumes another team’s budget.
- No: Quotas are less critical but still useful as guardrails.
5. Can leadership or teams see what they spend on Kubernetes?
- No: Deploy cost visibility tooling. Behavioral change from visibility often delivers 10-20% savings with zero technical effort.
- Yes: Refine allocation granularity and set up budget alerts.
Lever 1: Rightsizing Recommendations#
Rightsizing means adjusting resource requests to match actual usage plus a safety buffer. It is almost always the single largest cost reduction available.
When to use: Always. Every cluster benefits from rightsizing.
When to defer: Only when you lack usage data. Guessing is worse than over-provisioning.
The formula:
new_request = p95_actual_usage * 1.2 (CPU)
new_request = p99_actual_usage * 1.15 (memory -- less buffer because OOM kills are harsher than CPU throttling)Tool selection for rightsizing:
| Tool | Best For | Effort |
|---|---|---|
| VPA in Off mode | Per-deployment recommendations from real usage | Low (deploy and wait) |
| Goldilocks | Namespace-wide dashboard of VPA recommendations | Low (label namespace and view) |
| Kubecost savings report | Dollar-denominated rightsizing suggestions | Medium (install and configure cloud billing) |
| Manual Prometheus queries | Full control, custom aggregation windows | High |
Risk mitigation: Roll out changes to one deployment at a time. Monitor for 48 hours before proceeding. Watch for CPU throttling (container_cpu_cfs_throttled_periods_total) and OOM kills (kube_pod_container_status_last_terminated_reason).
Lever 2: Spot and Preemptible Instances#
Spot instances provide 60-90% discounts on compute in exchange for accepting that the cloud provider can reclaim the node with minimal notice.
When to use:
- Stateless services with 3+ replicas behind a load balancer
- Batch jobs with checkpointing or retry logic
- CI/CD runners and build agents
- Dev, staging, and QA environments (entire environments)
- Queue consumers and stream processors
When NOT to use:
- Databases, stateful singletons, or anything with local data that cannot be quickly reconstructed
- Control plane components (etcd, API server)
- Workloads that cannot tolerate a 30-second to 2-minute shutdown window
- Single-replica services with no fallback
Architecture decision: Use a mixed node pool strategy. On-demand nodes handle baseline critical workloads. Spot nodes handle burst and fault-tolerant workloads. Use taints on spot nodes to prevent non-tolerant pods from scheduling there.
Instance diversification is essential. Configure 10-15 instance types across 3+ availability zones to avoid capacity shortfalls. Karpenter handles this automatically. For Cluster Autoscaler, use multiple node groups with capacity-optimized allocation.
Lever 3: Cluster Autoscaler Tuning#
A default-configured autoscaler leaves money on the table. Tuning the autoscaler improves how efficiently nodes are utilized and how quickly idle nodes are removed.
Key tuning parameters:
# Cluster Autoscaler configuration flags
--scale-down-delay-after-add=10m # Wait 10 min after adding a node before considering scale-down
--scale-down-unneeded-time=5m # Node must be underutilized for 5 min before removal
--scale-down-utilization-threshold=0.5 # Node is "underutilized" below 50% request utilization
--expander=least-waste # Choose the node group that wastes least capacity
--max-empty-bulk-delete=5 # Remove up to 5 empty nodes at once
--skip-nodes-with-local-storage=false # Allow scale-down of nodes with emptyDir volumesExpander strategy decision:
| Expander | When to Use |
|---|---|
least-waste |
Cost optimization is the priority. Picks the node group that leaves the least unused capacity after scheduling. |
priority |
You have preferred node groups (e.g., spot first, on-demand fallback). Define explicit ordering. |
random |
Node groups are equivalent and you want even distribution. |
most-pods |
You want to maximize the number of pending pods that get scheduled per scaling event. |
Karpenter alternative: Karpenter’s consolidation feature (consolidationPolicy: WhenEmptyOrUnderutilized) is more aggressive than Cluster Autoscaler scale-down. It proactively moves pods to achieve better bin-packing rather than waiting for nodes to become empty.
Lever 4: Resource Quotas and Governance#
Resource quotas cap the total resources a namespace can consume. They prevent any single team or project from consuming unbounded cluster resources.
When to use: Multi-tenant clusters, shared clusters across teams, environments where developers deploy directly.
Quota strategy decision:
| Strategy | Description | Best For |
|---|---|---|
| Hard quotas per namespace | Fixed limits that cannot be exceeded | Production namespaces with predictable workloads |
| Soft quotas with alerts | LimitRanges set defaults, monitoring alerts on high usage | Development environments where flexibility matters |
| Hierarchical quotas | Parent quota splits across child namespaces | Large organizations with team-of-teams structure |
Always deploy LimitRanges alongside ResourceQuotas. Without LimitRanges, pods without explicit requests fail admission when a quota exists. With LimitRanges, every pod gets sensible defaults automatically.
Lever 5: Cost Allocation and Visibility#
Cost visibility changes behavior. When teams can see that their namespace costs $4,200/month and the idle overnight spend is $1,800, they fix it. Without visibility, nobody owns the cost.
Tool selection:
| Tool | License | Cloud Billing Integration | Multi-Cluster | Allocation Granularity |
|---|---|---|---|---|
| Kubecost | Free tier + Enterprise | AWS, GCP, Azure | Enterprise only | Pod, namespace, label, controller |
| OpenCost | Open source (CNCF) | AWS, GCP, Azure | Via federation | Pod, namespace, label, controller |
| Cloud-native tools (AWS CUR, GCP billing, Azure Cost Management) | Included | Native | Yes | Instance-level only (no pod granularity) |
Kubecost vs OpenCost decision:
- Choose OpenCost when you want a free, open-source baseline with no licensing concerns.
- Choose Kubecost when you need the savings recommendations engine, the web dashboard, or multi-cluster aggregation.
- Both use the same cost model engine (OpenCost is the open-source core of Kubecost).
Consistent labels on namespaces and pods (cost-center, team, env) are the foundation of accurate cost allocation. Without them, shared resources (ingress controllers, monitoring stacks, system namespaces) get attributed incorrectly or not at all.
Common Anti-Patterns#
Optimizing before measuring. Reducing resource requests based on intuition rather than data leads to outages. Always collect at least 7 days of usage data first.
Applying one strategy everywhere. Spot instances work for stateless batch workers but are dangerous for databases. Match the optimization to the workload.
Setting quotas without defaults. A ResourceQuota without a LimitRange causes all pods without explicit requests to fail admission. Always deploy LimitRanges alongside quotas.
One-time optimization. Workload patterns change. Without a recurring review cadence, waste accumulates within 3-6 months of any optimization effort. Build cost review into your operational rhythm.