Kubernetes Resource Management Deep Dive#

Resource management in Kubernetes is the mechanism that decides which pods get scheduled, which pods get killed when the node runs low, and how much CPU and memory each container is actually allowed to use. The surface-level concept of requests and limits is straightforward. The underlying mechanics – QoS classification, CFS CPU quotas, kernel OOM scoring, kubelet eviction thresholds – are where misconfigurations cause production outages.

Requests and Limits: What They Actually Control#

Requests are a scheduling constraint. When a pod declares requests.cpu: 500m and requests.memory: 256Mi, the scheduler finds a node with at least that much allocatable capacity remaining. Once scheduled, those resources are reserved – no other pod can claim them. But the container can use less than its request (the excess sits idle on the node) or more than its request (if the node has spare capacity and no limit prevents it).

Limits are an enforcement ceiling. They tell the kernel (via cgroups) the absolute maximum a container may use. CPU limits result in throttling. Memory limits result in OOM kills.

The gap between request and limit is overcommit. If you request 256Mi but set a limit of 1Gi, you are telling Kubernetes: “I need 256Mi guaranteed, but I might use up to 1Gi if it is available.” The scheduler only accounts for the 256Mi. If every pod on a node bursts to its limit simultaneously, the node runs out of memory and the kernel starts killing processes.

QoS Classes#

Kubernetes automatically assigns one of three Quality of Service classes to every pod based on its resource configuration. You do not set QoS directly – it is derived.

Guaranteed#

Every container in the pod has both CPU and memory requests set, and requests equal limits for both resources.

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 500m
    memory: 512Mi

Guaranteed pods are the last to be evicted and receive the lowest OOM score adjustment (-997). Use this class for databases, stateful services, and anything where an unexpected kill causes data loss or extended recovery time.

The tradeoff: no bursting. The container can never exceed its request because the request equals the limit. You must size the resources accurately.

Burstable#

At least one container has a request or limit set, but they are not all equal. This is the most common class in practice.

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 1Gi

Burstable pods have middle eviction priority. Their OOM score varies based on how much memory they consume relative to their request. Use this class for most application workloads – web servers, API services, workers.

BestEffort#

No requests or limits are set on any container in the pod.

resources: {}

BestEffort pods are the first to be evicted and receive the highest OOM score adjustment (1000). They can consume whatever resources are available on the node, but they have zero protection. Do not use this in production. Even for batch jobs, set at least a request so the scheduler can make informed decisions.

CPU Management: CFS Quotas and Throttling#

CPU requests map to CFS (Completely Fair Scheduler) shares. A container with requests.cpu: 500m gets 512 shares (out of 1024 per core). When the node is under contention, the CFS divides CPU time proportionally by shares. When the node has spare capacity, shares do not limit anything – containers can use as much idle CPU as they want.

CPU limits map to CFS quotas. A container with limits.cpu: 500m gets a quota of 50ms per 100ms period. If the container tries to use more CPU in that period, it is throttled – the process is paused until the next period begins.

The critical insight: CPU throttling is invisible to most monitoring. The application does not crash. It does not log an error. It just runs slower. Response times increase, queues back up, and the symptoms look like an application performance problem, not a resource problem.

To detect throttling, check these Prometheus metrics:

# Total throttled periods
rate(container_cpu_cfs_throttled_periods_total{container="myapp"}[5m])

# Total throttled time in seconds
rate(container_cpu_cfs_throttled_seconds_total{container="myapp"}[5m])

# Throttling percentage
rate(container_cpu_cfs_throttled_periods_total{container="myapp"}[5m])
/
rate(container_cpu_cfs_periods_total{container="myapp"}[5m])

A throttling percentage above 10-20% is cause for investigation. Above 50% means the application is severely constrained.

Should you set CPU limits? There is a strong argument for not setting CPU limits and only setting CPU requests. Without a limit, the container can burst freely when the node has idle CPU, and the CFS shares from requests still provide fair scheduling under contention. Many production clusters (including Google’s internal clusters) operate this way. Set CPU limits only when you need strict isolation – for example, a noisy neighbor that would monopolize CPU and starve other tenants.

Memory Management: OOM Kills#

Memory behaves fundamentally differently from CPU. CPU is compressible – a throttled container is slower but still runs. Memory is incompressible – when a container exceeds its memory limit, the kernel OOM killer terminates the process immediately.

Always set memory limits. Without a memory limit, a container with a memory leak will grow until it triggers the node-level OOM killer, which can kill any process on the node – including kubelet or other pods.

The kernel OOM score for a container is calculated from two factors:

oom_score_adj: Set by the kubelet based on QoS class. BestEffort gets 1000 (killed first). Burstable gets a value between 2 and 999, calculated based on the memory request relative to the node’s total memory. Guaranteed gets -997 (killed last).
Memory usage relative to the request: Within the same QoS class, containers using more memory relative to their request get a higher OOM score.

When the node runs out of memory, the kernel kills the process with the highest combined OOM score. This means: among Burstable pods, the one consuming the most memory above its request dies first.

# Check if a pod was OOMKilled
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

# Check memory usage vs limits
kubectl top pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].resources.limits.memory}'

Node-Level Eviction#

Independent of container-level OOM kills, the kubelet monitors node resources and evicts pods proactively when thresholds are breached.

Eviction Signals#

Signal	Description	Default Threshold
`memory.available`	Available memory on the node	100Mi
`nodefs.available`	Available disk on the node filesystem	10%
`nodefs.inodesFree`	Available inodes	5%
`imagefs.available`	Available disk on the image filesystem	15%
`pid.available`	Available PIDs	(none by default)

Soft vs Hard Eviction#

Soft eviction has a grace period. If memory drops below the soft threshold and stays there for the grace period, the kubelet evicts pods. This gives transient spikes time to resolve.

Hard eviction is immediate. When memory drops below the hard threshold, the kubelet evicts pods with no grace period. Hard thresholds are the last line of defense before the kernel OOM killer takes over (which is less predictable about what it kills).

Eviction Order#

When the kubelet evicts pods, it follows this priority:

BestEffort pods using resources above their request (they have no request, so any usage counts)
Burstable pods using resources above their request
Burstable pods using resources at or below their request
Guaranteed pods (only when the node is truly out of resources)

Within each category, the kubelet evicts pods using the most resources above their request first.

Capacity Planning#

The total CPU and memory on a node is not fully available for pods. Allocatable capacity is calculated as:

Allocatable = Total - kube-reserved - system-reserved - eviction-threshold

kube-reserved#

Resources reserved for Kubernetes system daemons: kubelet, container runtime. Without this, pod workloads can starve the kubelet itself, causing the node to become NotReady.

# kubelet config
kubeReserved:
  cpu: 100m
  memory: 256Mi

system-reserved#

Resources reserved for OS-level system processes (sshd, journald, systemd).

systemReserved:
  cpu: 100m
  memory: 256Mi

Practical Example#

A node with 4 CPU cores and 16Gi memory, with 200m CPU and 512Mi reserved for kube and system, and a 100Mi eviction threshold:

Allocatable CPU: 4000m - 200m = 3800m
Allocatable Memory: 16Gi - 512Mi - 100Mi = ~15.4Gi

When planning capacity, use allocatable resources, not total resources. The kubectl describe node output shows both.

Resource Quotas and Limit Ranges#

Resource Quotas#

Quotas limit the total aggregate resources a namespace can consume. They prevent a single team from consuming all cluster capacity.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"

When a quota is active, every pod in the namespace must specify resource requests and limits (or have them injected by a LimitRange). Pods without them are rejected by the admission controller.

Limit Ranges#

LimitRanges set per-container defaults and constraints within a namespace. They solve the problem of developers forgetting to set resource specifications.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
  - type: Container
    default:
      cpu: 500m
      memory: 256Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
    max:
      cpu: "2"
      memory: 2Gi
    min:
      cpu: 50m
      memory: 64Mi

When a pod is created without resource specifications, the LimitRange injects the defaults. If a pod specifies resources outside the min/max range, it is rejected.

Monitoring Resource Usage#

kubectl top#

kubectl top nodes
kubectl top pods -n <ns> --sort-by=memory
kubectl top pods -n <ns> --sort-by=cpu

This shows current usage, not historical trends. Useful for spot checks, not for capacity planning.

Prometheus Metrics#

The metrics that matter for resource management:

# Actual CPU usage (rate of CPU seconds consumed)
rate(container_cpu_usage_seconds_total{container="myapp"}[5m])

# Memory working set (what matters for OOM decisions)
container_memory_working_set_bytes{container="myapp"}

# Memory RSS (resident set size -- actual physical memory pages)
container_memory_rss{container="myapp"}

# Total memory usage (includes filesystem cache -- misleading for OOM analysis)
container_memory_usage_bytes{container="myapp"}

Critical distinction: container_memory_usage_bytes includes the filesystem page cache. A container reading large files will show high memory “usage” even though the cache can be reclaimed instantly by the kernel. The OOM killer uses working_set_bytes (usage minus inactive file cache), not usage_bytes. If you alert on usage_bytes, you will get false alarms from containers that simply read a lot of files.

Right-Sizing with VPA#

The Vertical Pod Autoscaler (VPA) observes actual resource usage over time and recommends (or automatically applies) request/limit values. Running VPA in recommendation mode is a safe way to identify over-provisioned and under-provisioned containers:

kubectl get vpa -n <ns>
kubectl describe vpa <name> -n <ns>

The recommendation output shows lower bound, target, upper bound, and uncapped target for both CPU and memory. Use the target as a starting point for setting requests.

Common Production Gotchas#

CPU throttling with no visible errors. The container appears slow but logs show nothing wrong. CPU throttling does not produce errors – it pauses the process. Check container_cpu_cfs_throttled_seconds_total. The fix is either to increase the CPU limit or remove it entirely and rely on requests for fair scheduling.

Memory usage looks high but no OOM kills. The container_memory_usage_bytes metric includes filesystem cache, which inflates the apparent usage. Check container_memory_working_set_bytes instead. If working set is well below the limit, the container is fine – the kernel will reclaim cache pages as needed.

Pods evicted on a node with “available” resources. Allocatable resources account for reserved capacity and eviction thresholds. If kube-reserved and system-reserved are not configured, the kubelet’s own resource usage competes with pod workloads, and the eviction thresholds kick in before all allocatable memory is consumed. Configure reservations correctly and plan capacity against allocatable, not total.

Resource quota prevents deployments. When a ResourceQuota is active, pods without resource specifications are rejected. If a LimitRange is not configured to inject defaults, deployments that previously worked will start failing as soon as the quota is applied. Always pair ResourceQuotas with LimitRanges.