Kubernetes Scheduler: How Pods Get Placed on Nodes#

The scheduler (kube-scheduler) watches for newly created pods that have no node assignment. For each unscheduled pod, the scheduler selects the best node and writes a binding back to the API server. The kubelet on that node then starts the pod. If no node is suitable, the pod stays Pending until conditions change.

The scheduler is the reason pods run where they do. Understanding its internals is essential for diagnosing Pending pods, designing placement constraints, and managing cluster utilization.

The Scheduling Cycle#

Each scheduling attempt follows a two-phase pipeline: filtering, then scoring.

Unscheduled Pod
  --> Filtering (eliminate infeasible nodes)
  --> Scoring (rank remaining nodes 0-100)
  --> Select highest-scoring node
  --> Binding (write pod.spec.nodeName to API server)

If filtering eliminates all nodes, the pod stays Pending and an event is recorded. If scoring produces a tie, the scheduler picks randomly among the tied nodes.

Filtering Phase (Predicates)#

Filtering removes nodes that cannot possibly run the pod. Each filter is a binary pass/fail check. A node must pass every filter to remain a candidate.

Resource fit. The node must have enough allocatable CPU and memory to satisfy the pod’s resource requests. This is based on requests, not limits – if a node has 4 CPU allocatable and existing pods request 3.5 CPU, a pod requesting 1 CPU will not fit even if actual usage is low.

# Check allocatable vs requested resources on each node
kubectl describe nodes | grep -A 5 "Allocated resources"

Node selectors and affinity. The pod’s nodeSelector or nodeAffinity must match the node’s labels.

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-east-1a", "us-east-1b"]

The required rules are hard constraints used during filtering. The preferred rules are soft constraints used during scoring.

Taints and tolerations. Nodes can be tainted to repel pods. Only pods with matching tolerations pass the filter.

# Common taint: control plane nodes
kubectl describe node control-plane-1 | grep Taints
# Taints: node-role.kubernetes.io/control-plane:NoSchedule

A pod without a toleration for this taint will never be scheduled to the control plane node.

Port conflicts. If a pod requests a specific hostPort, the scheduler checks that the port is not already in use on the node.

Volume topology. For pods using PersistentVolumeClaims bound to zone-specific PersistentVolumes, the scheduler only considers nodes in the matching zone.

Pod topology spread. If the pod has topologySpreadConstraints with whenUnsatisfiable: DoNotSchedule, the scheduler rejects nodes where placing the pod would violate maxSkew.

Scoring Phase (Priorities)#

After filtering, the scheduler scores remaining nodes from 0 to 100 on each scoring plugin, then computes a weighted sum.

LeastAllocated. Prefers nodes with the most free resources. This spreads pods across nodes and is the default behavior. Good for clusters where you want to maximize headroom on each node for burst workloads.

MostAllocated. Prefers nodes with the least free resources. This packs pods tightly (bin-packing) and is useful for cost optimization – concentrate workloads onto fewer nodes so the cluster autoscaler can remove idle ones.

InterPodAffinity. Scores based on pod affinity and anti-affinity preferences. Pods with preferredDuringSchedulingIgnoredDuringExecution anti-affinity will score higher on nodes that do not already run matching pods.

ImageLocality. Prefers nodes that already have the container image cached. Avoids image pull time, which matters for large images.

TopologySpreadConstraints (preferred). Scores based on how evenly pods are distributed across topology domains when whenUnsatisfiable: ScheduleAnyway is set.

Priority and Preemption#

When no node passes filtering, the scheduler checks whether evicting lower-priority pods could make room. This is preemption.

PriorityClass#

Priority is assigned via PriorityClass objects. Higher values mean higher priority.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-workload
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Priority class for critical production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-workload
value: 100
globalDefault: false
preemptionPolicy: Never
description: "Low-priority batch jobs that should never preempt other pods"

Two built-in priority classes exist: system-cluster-critical (2000000000) and system-node-critical (2000001000). These are reserved for cluster infrastructure like CoreDNS and kube-proxy.

Assign a priority class to a pod:

spec:
  priorityClassName: critical-workload
  containers:
  - name: app
    image: my-app:latest

How Preemption Works#

The scheduler finds a pod cannot be scheduled on any node.
It simulates removing lower-priority pods from each node and re-runs filtering.
If removing lower-priority pods on a node would make the pending pod schedulable, that node is a preemption candidate.
The scheduler selects the candidate that requires the fewest and lowest-priority evictions.
The pod’s status.nominatedNodeName is set to the target node.
Lower-priority pods are evicted (respecting PodDisruptionBudgets where possible).
Once resources are freed, the scheduler binds the pod to the nominated node.

Preemption vs Eviction#

These are different mechanisms:

Aspect	Preemption	Eviction
Initiated by	Scheduler	Kubelet
Reason	Make room for higher-priority pod	Node under resource pressure
Respects PDBs	Best effort	Yes (graceful), No (hard eviction)
Signal	Pod stays Pending, `nominatedNodeName` set	Node has `MemoryPressure` or `DiskPressure` condition

Scheduler Profiles and Plugins#

Since Kubernetes 1.19, the scheduler is fully plugin-based. Each scheduling decision point (QueueSort, PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, PreBind, Bind, PostBind) is a plugin extension point.

You can configure multiple scheduler profiles in a single scheduler instance:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 1
      - name: InterPodAffinity
        weight: 2
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: LeastAllocated
- schedulerName: bin-packing-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 1
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated

Pods select their scheduler with spec.schedulerName:

spec:
  schedulerName: bin-packing-scheduler
  containers:
  - name: batch-job
    image: batch-processor:latest

Scheduler Extenders#

For custom scheduling logic that cannot be expressed with built-in plugins, scheduler extenders call an external HTTP endpoint during filtering and scoring. The extender receives a list of candidate nodes and returns a filtered or scored subset.

This is useful for integrating with external systems – for example, checking a GPU inventory service before scheduling a GPU workload, or consulting a license server for licensed software.

Debugging Scheduling Failures#

Reading Events#

The scheduler records events on pods explaining why they cannot be scheduled:

# Check events on a Pending pod
kubectl describe pod my-pod -n my-namespace
# Events:
#   Warning  FailedScheduling  default-scheduler
#     0/5 nodes are available: 2 Insufficient memory, 3 node(s) had untolerated taint {dedicated: gpu-workloads}

# Search for scheduling failures across the cluster
kubectl get events --field-selector reason=FailedScheduling -A

The message format 0/N nodes are available: tells you the total node count, followed by how many nodes failed each filter. The numbers should add up to N (a single node can fail multiple filters but is counted under the first one that rejected it).

Common Failure Messages#

Message	Meaning	Fix
`Insufficient cpu`	Pod requests more CPU than any node has available	Reduce requests, scale up nodes, or evict other pods
`Insufficient memory`	Same for memory	Same approach
`node(s) didn't match Pod's node affinity/selector`	No node has the required labels	Add labels to nodes or fix the selector
`node(s) had untolerated taint`	Node is tainted, pod lacks toleration	Add toleration to pod or remove taint from node
`persistentvolumeclaim not found`	PVC does not exist or is not bound	Create the PVC or wait for dynamic provisioning
`pod topology spread constraints not satisfiable`	Cannot place pod without violating maxSkew	Add more nodes in the deficient topology domain

Pending Pods with No Events#

If a pod is Pending and has no scheduling events, the scheduler has not processed it yet. This can happen when:

The scheduler is down or crashlooping (check kubectl get pods -n kube-system -l component=kube-scheduler).
The pod uses a custom schedulerName and that scheduler is not running.
The cluster has thousands of pending pods and the scheduler has a backlog.

# Verify the scheduler is running and healthy
kubectl get pods -n kube-system -l component=kube-scheduler
kubectl logs -n kube-system kube-scheduler-<node-name> --tail=50

Common Gotchas#

Pods scheduled but immediately evicted. When resource requests are set too low, the scheduler places the pod on a node where it appears to fit, but actual usage exceeds the node’s capacity. The kubelet then evicts the pod under resource pressure. The fix is to set resource requests closer to actual usage – requests represent the guaranteed minimum, and the scheduler uses them to make placement decisions.

DaemonSet pods bypassing the scheduler. Before Kubernetes 1.12, DaemonSets created pods directly with nodeName set, bypassing the scheduler entirely. Modern versions use the scheduler, but if you see DaemonSet pods Pending with scheduling errors, check taints and tolerations – DaemonSets do not automatically tolerate all taints.

Preemption cascades. If you have many priority levels and heavily loaded clusters, a high-priority pod can trigger preemption that displaces a medium-priority pod, which then triggers further preemption of low-priority pods. Use preemptionPolicy: Never on workloads that should wait rather than preempt.