Taints, Tolerations, and Node Affinity#

Pod scheduling in Kubernetes defaults to “run anywhere there is room.” In production, that is rarely what you want. GPU workloads should land on GPU nodes. System components should not compete with application pods. Nodes being drained should stop accepting new work. Taints, tolerations, and node affinity give you control over where pods run and where they do not.

Taints: Repelling Pods from Nodes#

A taint is applied to a node and tells the scheduler “do not place pods here unless they explicitly tolerate this taint.” Taints have three parts: a key, a value, and an effect.

# Add a taint to a node
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

# Remove a taint (trailing hyphen)
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule-

# View taints on a node
kubectl describe node gpu-node-1 | grep -A5 Taints

Taint Effects#

There are three taint effects, and each behaves differently:

Effect Behavior
NoSchedule New pods without a matching toleration will not be scheduled on the node. Existing pods are not affected.
PreferNoSchedule Soft version of NoSchedule. The scheduler tries to avoid the node but will place pods there if no other option exists.
NoExecute Pods without a matching toleration are evicted from the node immediately. New pods are also blocked.

NoExecute is the most aggressive. When you add a NoExecute taint to a node, every pod that does not tolerate it gets evicted right then. This is what happens during node drain operations and when Kubernetes detects node problems.

Tolerations: Allowing Pods on Tainted Nodes#

A toleration in a pod spec says “I can run on nodes with this taint.” Tolerations do not force a pod onto a tainted node – they just remove the restriction.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-training-job
spec:
  template:
    spec:
      tolerations:
        # Exact match: key, value, and effect must all match
        - key: "gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      containers:
        - name: trainer
          image: training-job:latest

Match Operators#

There are two operators for tolerations:

  • Equal: The key, value, and effect must all match the taint exactly. This is the default.
  • Exists: Only the key must match. The value is ignored. Useful when you do not care about the taint value.
tolerations:
  # Exists operator: matches any taint with key "gpu" regardless of value
  - key: "gpu"
    operator: "Exists"
    effect: "NoSchedule"

  # Tolerate ALL taints (use with extreme caution)
  - operator: "Exists"

tolerationSeconds for NoExecute#

When a NoExecute taint is applied, you can use tolerationSeconds to control how long a pod stays before eviction:

tolerations:
  - key: "node.kubernetes.io/not-ready"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300  # Stay for 5 minutes, then get evicted

Without tolerationSeconds, a pod that tolerates a NoExecute taint stays indefinitely. With it, the pod is evicted after the specified number of seconds.

Built-in Taints#

Kubernetes automatically applies taints to nodes in certain conditions. These are the most common:

Taint When Applied
node.kubernetes.io/not-ready Node condition is not Ready
node.kubernetes.io/unreachable Node is unreachable from the control plane
node.kubernetes.io/memory-pressure Node is running low on memory
node.kubernetes.io/disk-pressure Node is running low on disk space
node.kubernetes.io/pid-pressure Node is running low on PIDs
node.kubernetes.io/unschedulable Node has been cordoned (kubectl cordon)
node.kubernetes.io/network-unavailable Node network is not configured

Kubernetes adds default tolerations to every pod for not-ready and unreachable with tolerationSeconds: 300. This means pods survive brief node hiccups (up to 5 minutes) before being evicted and rescheduled.

Node Affinity: Attracting Pods to Nodes#

While taints repel pods, node affinity attracts pods to specific nodes based on node labels. There are two types:

Required Affinity (Hard Rule)#

requiredDuringSchedulingIgnoredDuringExecution means the pod will only be scheduled on nodes matching the expression. If no matching node exists, the pod stays Pending.

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-type
                operator: In
                values:
                  - compute
                  - general

Preferred Affinity (Soft Rule)#

preferredDuringSchedulingIgnoredDuringExecution tells the scheduler to try to place the pod on matching nodes but does not block scheduling if none are available.

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-east-1a
        - weight: 20
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-east-1b

The weight (1-100) influences how strongly the scheduler prefers matching nodes. Higher weight means stronger preference. The scheduler sums weights from all matching preferred rules when scoring nodes.

The IgnoredDuringExecution part means that if a node’s labels change after a pod is already running, the pod is not evicted. Affinity rules only apply at scheduling time.

Node Selectors vs Node Affinity#

nodeSelector is the simpler, older mechanism. It takes a flat map of label key-value pairs and only schedules on nodes matching all of them:

spec:
  nodeSelector:
    node-type: compute
    disk: ssd

Use nodeSelector when you need simple exact-match placement. Use node affinity when you need In, NotIn, Exists, DoesNotExist, or Gt/Lt operators, or when you want preferred (soft) rules with weights.

Label Strategies for Nodes#

Consistent node labeling makes affinity rules manageable:

# Well-known topology labels (set automatically by cloud providers)
topology.kubernetes.io/zone=us-east-1a
topology.kubernetes.io/region=us-east-1
kubernetes.io/arch=arm64
kubernetes.io/os=linux

# Custom labels for workload isolation
kubectl label nodes node-3 node-type=compute
kubectl label nodes node-4 gpu=nvidia-a100
kubectl label nodes node-5 disk=nvme

Combining Taints and Node Affinity#

For dedicated node pools, use both taints and node affinity together. Taints keep unwanted pods off the node. Affinity directs the right pods to it. This is the belt-and-suspenders approach:

# Step 1: Taint the GPU nodes
# kubectl taint nodes gpu-node-1 workload=gpu:NoSchedule
# kubectl label nodes gpu-node-1 workload=gpu

# Step 2: Pod spec with both toleration and affinity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training
spec:
  template:
    spec:
      tolerations:
        - key: "workload"
          operator: "Equal"
          value: "gpu"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: workload
                    operator: In
                    values:
                      - gpu
      containers:
        - name: trainer
          image: ml-trainer:latest
          resources:
            limits:
              nvidia.com/gpu: 1

Without the taint, general workloads could land on the GPU node and waste expensive resources. Without the affinity, the GPU pod might schedule on a non-GPU node (the toleration alone does not direct it anywhere).

Common Gotchas#

DaemonSets and taints. DaemonSets that run monitoring agents, log collectors, or network plugins must tolerate the taints on every node they need to run on. If you taint GPU nodes and your Prometheus node-exporter DaemonSet does not have a matching toleration, those nodes will have no metrics. Always check your DaemonSets when adding new taints.

# DaemonSet that runs everywhere, even on tainted nodes
tolerations:
  - operator: "Exists"  # Tolerate everything

NoExecute eviction surprises. Adding a NoExecute taint evicts pods immediately. If your application has no tolerationSeconds and no PodDisruptionBudget, all replicas on that node can be killed at once. Always pair NoExecute with proper PDBs and consider adding tolerationSeconds to give pods time to drain gracefully.

Taint typos are silent. If you taint a node with gpuu=true:NoSchedule (typo), no pods tolerate it, and the scheduler just avoids the node. Nothing errors out. Verify taints with kubectl describe node after applying them.