Node Drain and Cordon#

Node maintenance is a routine part of cluster operations: kernel patches, instance type changes, Kubernetes upgrades, hardware replacement. The tools are kubectl cordon (stop scheduling new pods) and kubectl drain (evict existing pods). Getting the flags and sequence right is the difference between a seamless operation and a production incident.

Cordon: Mark Unschedulable#

Cordon sets the spec.unschedulable field on a node to true. The scheduler will not place new pods on it, but existing pods continue running undisturbed.

kubectl cordon node-1

# Verify
kubectl get node node-1
# NAME     STATUS                     ROLES    AGE   VERSION
# node-1   Ready,SchedulingDisabled   worker   90d   v1.31.0

# Reverse it
kubectl uncordon node-1

Cordon is non-disruptive. Use it when you want to stop new work from landing on a node before you drain it, or when investigating a node issue without immediately evicting workloads.

Drain: Evict Pods Safely#

kubectl drain does two things in sequence: it cordons the node, then evicts all pods from it. Eviction goes through the Kubernetes Eviction API, which means PodDisruptionBudgets are respected.

kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

What Drain Does Step by Step#

Cordons the node – marks it unschedulable.
Identifies all pods on the node – excluding mirror pods (static pods managed by kubelet) and DaemonSet pods (if --ignore-daemonsets is set).
Sends eviction requests through the Eviction API for each pod. This is not a delete – it is a polite request that respects PDBs.
Waits for pods to terminate. Each pod gets its terminationGracePeriodSeconds to shut down cleanly.
Reports completion once all pods are gone or the timeout is reached.

The Flags That Matter#

kubectl drain node-1 \
  --ignore-daemonsets \       # Skip DaemonSet pods (they will be recreated anyway)
  --delete-emptydir-data \    # Delete pods using emptyDir volumes (data is lost)
  --force \                   # Delete pods not managed by a controller (bare pods)
  --grace-period=30 \         # Override pod's terminationGracePeriodSeconds
  --timeout=300s \            # Give up after 5 minutes
  --pod-selector='app!=critical' \  # Only drain pods matching this selector
  --disable-eviction          # Use DELETE instead of Eviction API (skips PDB checks)

--ignore-daemonsets: Almost always required. DaemonSet pods run on every node by definition. Drain cannot evict them (they would just be rescheduled back), so without this flag, drain errors out when it encounters them.

--delete-emptydir-data: Required if any pod uses emptyDir volumes. Drain refuses to evict these pods by default because eviction destroys the data. If the data is ephemeral (caches, temp files), this flag is safe.

--force: Required for pods not managed by a ReplicaSet, Deployment, StatefulSet, or Job. These “bare pods” will not be recreated after eviction. Drain warns you and refuses without this flag.

--grace-period: Overrides the pod’s configured terminationGracePeriodSeconds. Useful when you need to speed up a drain, but be aware that pods may not shut down cleanly if the grace period is too short.

--timeout: How long drain waits for all pods to be evicted. If exceeded, drain exits with an error but the node remains cordoned. Default is no timeout (waits forever).

--disable-eviction: Bypasses the Eviction API entirely and issues direct DELETE requests. This ignores PDBs. Use only as a last resort when PDBs are blocking a drain you must complete.

PodDisruptionBudgets Blocking Drains#

The most common drain problem is a PDB that will not allow any more disruptions. Drain sends eviction requests through the API server, and the API server rejects evictions that would violate a PDB.

Symptoms: drain hangs indefinitely, printing messages like evicting pod default/my-app-abc123 but never completing.

Diagnose:

# Find PDBs with zero allowed disruptions
kubectl get pdb --all-namespaces
# NAMESPACE   NAME         MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# default     my-app-pdb   2               N/A               0                     30d

# Check why disruptions are zero
kubectl describe pdb my-app-pdb
# Look at currentHealthy vs desiredHealthy

Common causes:

Single-replica deployment with minAvailable: 1 – there is never room to evict the one pod. Fix: use maxUnavailable: 1 instead, or scale up before draining.
Pods already unhealthy – if currentHealthy is already at or below desiredHealthy, no evictions are allowed. Fix the unhealthy pods first.
Multiple nodes draining simultaneously – the first drain consumed all allowed disruptions. Drain nodes one at a time.

When it is safe to override:

# Nuclear option: bypass PDB checks entirely
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data --disable-eviction

Only do this when you have confirmed the workload can tolerate the disruption, or when the node is already dead and pods are not running anyway.

Special Pod Types During Drain#

DaemonSet pods: Ignored with --ignore-daemonsets. They keep running on the node until the node is removed or the DaemonSet is deleted. If you are decommissioning the node, they will be cleaned up automatically.

Static pods: Managed directly by kubelet via manifest files in /etc/kubernetes/manifests/. Drain does not touch them. To remove them, delete the manifest file on the node.

Pods with local storage (hostPath): Drain skips these by default. Unlike emptyDir, hostPath data persists on the node and may be important. Use --force if the data is expendable.

Common Scenarios#

Node Replacement#

# 1. Cordon to stop new scheduling
kubectl cordon node-old

# 2. Verify new node is ready
kubectl get nodes

# 3. Drain the old node
kubectl drain node-old --ignore-daemonsets --delete-emptydir-data --timeout=600s

# 4. Verify pods rescheduled
kubectl get pods --all-namespaces --field-selector spec.nodeName=node-old

# 5. Delete the node object (after decommissioning the VM)
kubectl delete node node-old

Kernel Patching#

# Drain, patch, reboot, uncordon
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
# SSH to node, apply patches, reboot
kubectl uncordon node-1

Rolling Node Upgrades#

When upgrading multiple nodes, drain one at a time. Wait for all evicted pods to be Running on other nodes before draining the next:

for node in node-1 node-2 node-3; do
  echo "Draining $node..."
  kubectl drain "$node" --ignore-daemonsets --delete-emptydir-data --timeout=300s

  echo "Waiting for pods to stabilize..."
  sleep 30
  kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

  # Perform maintenance on the node here

  kubectl uncordon "$node"
  echo "Uncordoned $node, waiting before next..."
  sleep 60
done