Upgrading Self-Managed Kubernetes Clusters with kubeadm#

Upgrading a kubeadm-managed cluster is a multi-step procedure that must be executed in a precise order. The control plane upgrades first, then worker nodes one at a time. Skipping steps or upgrading in the wrong order causes version skew violations that can break cluster communication.

This article provides the complete operational sequence. Execute each step in order. Do not skip ahead.

Version Skew Policy#

Kubernetes enforces strict version compatibility rules between components. Violating these rules results in undefined behavior – sometimes things work, sometimes the API server rejects requests, sometimes components silently fail.

Rules: kubeadm must match the target version exactly. kubelet can be up to 3 minor versions older than kube-apiserver (never newer). kube-controller-manager, kube-scheduler, and kube-proxy must not be newer than kube-apiserver (up to 1 minor version older). kubectl is supported within 1 minor version of kube-apiserver.

The critical rule: you can only upgrade one minor version at a time. To go from 1.29 to 1.31, you must upgrade to 1.30 first, verify, then upgrade to 1.31.

Pre-Upgrade Checklist#

Complete every item before starting the upgrade.

Step 1: Verify Current Cluster State#

# All nodes must be Ready
kubectl get nodes -o wide

# All system pods must be Running
kubectl get pods -n kube-system

# Check current versions of all components
kubectl get nodes -o custom-columns=NAME:.metadata.name,KUBELET:.status.nodeInfo.kubeletVersion,PROXY:.status.nodeInfo.kubeProxyVersion

# Verify the API server version
kubectl version

If any node is NotReady or any system pod is in a crash loop, fix those issues first. Do not upgrade a cluster that is already unhealthy.

Step 2: Check API Deprecations#

Every Kubernetes version removes deprecated API versions. Use pluto detect-all-in-cluster --target-versions k8s=v1.31 for a comprehensive scan. Check Helm releases too: pluto detect-helm --target-versions k8s=v1.31. Fix deprecated APIs before upgrading – upgrading first leads to broken deployments that cannot be modified.

Step 3: Review Release Notes and Addon Compatibility#

Read the changelog for the target version, focusing on removed APIs, changed defaults, and known issues. Verify that your CNI plugin, CSI drivers, ingress controller, cert-manager, and monitoring stack support the target version. List current addons with helm list -A.

Step 4: Check PodDisruptionBudgets#

PDBs that allow zero disruptions will block node drains. Find them with kubectl get pdb -A and look for ALLOWED DISRUPTIONS: 0. Either scale up the affected workload to free PDB budget or temporarily relax the PDB.

Step 5: Back Up etcd#

This is the most important step. If the upgrade goes wrong, this backup is your recovery path.

ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd-pre-upgrade-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd-pre-upgrade-*.db --write-table

Copy the snapshot off-cluster (S3, GCS, NFS, or another machine). A backup stored only on the node being upgraded is useless if that node fails.

Step 6: Back Up Certificates#

Back up /etc/kubernetes/pki, the kubeadm ConfigMap (kubectl -n kube-system get configmap kubeadm-config -o yaml), and admin.conf.

Control Plane Upgrade#

Step 7: Upgrade kubeadm on the First Control Plane Node#

The first control plane node is special – it runs kubeadm upgrade apply. All subsequent control plane nodes run kubeadm upgrade node.

# On the FIRST control plane node (Debian/Ubuntu)
apt-get update
apt-get install -y --allow-change-held-packages kubeadm=1.31.0-1.1
kubeadm version

For RHEL/CentOS, use yum install -y kubeadm-1.31.0-150500.1.1 --disableexcludes=kubernetes.

Step 8: Run the Upgrade Plan#

# See what kubeadm will do (dry run)
kubeadm upgrade plan

This outputs:

Current cluster version
Available upgrade targets
Components that will be upgraded
Any manual actions required

Review the output carefully. If it reports errors or warnings, address them before proceeding.

Step 9: Apply the Upgrade on the First Control Plane Node#

kubeadm upgrade apply v1.31.0

This upgrades the static pod manifests for kube-apiserver, kube-controller-manager, kube-scheduler, and etcd. A successful upgrade ends with SUCCESS! Your cluster was upgraded to "v1.31.0".

Step 10: Upgrade kubelet and kubectl on the First Control Plane Node#

# Drain the node (from a machine with kubectl access)
kubectl drain <first-control-plane-node> --ignore-daemonsets

# On the control plane node
apt-get install -y --allow-change-held-packages kubelet=1.31.0-1.1 kubectl=1.31.0-1.1
systemctl daemon-reload
systemctl restart kubelet

# Uncordon
kubectl uncordon <first-control-plane-node>

Step 11: Verify the First Control Plane Node#

# Confirm the node reports the new version
kubectl get node <first-control-plane-node> -o wide

# Confirm control plane pods are running
kubectl get pods -n kube-system -l tier=control-plane

Step 12: Upgrade Additional Control Plane Nodes#

For each additional control plane node, install the new kubeadm, then run kubeadm upgrade node (not kubeadm upgrade apply – that is only for the first node). Then drain, upgrade kubelet and kubectl, restart kubelet, and uncordon. Wait for each node to report Ready before moving to the next.

Worker Node Upgrade#

Step 13: Upgrade Worker Nodes One at a Time#

The process for each worker node is: drain, upgrade kubeadm, run kubeadm upgrade node, upgrade kubelet, restart kubelet, uncordon.

# From the kubectl machine: drain the worker
kubectl drain <worker-node> --ignore-daemonsets --delete-emptydir-data

# On the worker node: upgrade kubeadm
apt-get update
apt-get install -y --allow-change-held-packages kubeadm=1.31.0-1.1

# Upgrade the node configuration
kubeadm upgrade node

# Upgrade kubelet
apt-get install -y --allow-change-held-packages kubelet=1.31.0-1.1
systemctl daemon-reload
systemctl restart kubelet

# From the kubectl machine: uncordon
kubectl uncordon <worker-node>

# Verify
kubectl get node <worker-node> -o wide

Repeat for each worker node. Do not upgrade multiple workers simultaneously unless you have confirmed sufficient remaining capacity. A safe pace is one node at a time with a 5-minute verification pause.

Step 14: Handle Drain Failures#

If kubectl drain gets stuck, check these common causes: PDB blocking eviction (scale up the affected deployment to free PDB budget), bare pods with no controller (delete them manually – they will not be recreated), or pods with local storage (use --delete-emptydir-data and --timeout=300s). For PVC-backed pods, the drain waits for the pod to be rescheduled.

Post-Upgrade Validation#

Step 15: Verify the Complete Cluster#

kubectl get nodes -o wide           # All nodes at new version and Ready
kubectl get pods -n kube-system     # All system pods running
kubectl version                     # API server reports correct version

Run a smoke test: deploy nginx, expose it, curl the service, then clean up. Test DNS resolution with nslookup kubernetes.default.svc.cluster.local from a test pod.

Step 16: Verify Cluster Addons#

After the core cluster validation, verify that ingress controllers are serving traffic, cert-manager is issuing certificates (kubectl get certificates -A), and CSI drivers are functional (kubectl get csidrivers). Test PVC provisioning by creating a small test claim, confirming it binds, then deleting it.

Rollback Procedures#

Rollback Scenario: Upgrade Failed Midway#

If kubeadm upgrade apply fails, check the control plane state with crictl ps -a (since kubectl may not work). kubeadm creates a backup of the manifest directory before applying changes – restore from /etc/kubernetes/manifests.bak/ if needed.

Rollback Scenario: Cluster Unstable After Full Upgrade#

Fix forward (preferred): Most post-upgrade issues are API deprecation problems or addon incompatibilities. Fix manifests and update addons.

Restore from etcd backup (last resort): Reverts the entire cluster state. You lose all changes made after the backup.

# Stop all control plane components
systemctl stop kubelet

# On each etcd member, restore the snapshot
ETCDCTL_API=3 etcdctl snapshot restore /var/backups/etcd-pre-upgrade-*.db \
  --name=<member-name> \
  --initial-cluster=<member-name>=https://<ip>:2380 \
  --initial-cluster-token=etcd-cluster-restored \
  --initial-advertise-peer-urls=https://<ip>:2380 \
  --data-dir=/var/lib/etcd-restored

# Replace etcd data directory
mv /var/lib/etcd /var/lib/etcd-failed
mv /var/lib/etcd-restored /var/lib/etcd

# Downgrade kubeadm, kubelet, and kubectl to the previous version
apt-get install -y --allow-change-held-packages \
  kubeadm=1.30.0-1.1 \
  kubelet=1.30.0-1.1 \
  kubectl=1.30.0-1.1

# Restore the original static pod manifests
cp /var/backups/manifests-pre-upgrade/*.yaml /etc/kubernetes/manifests/

# Restart kubelet
systemctl daemon-reload
systemctl start kubelet

This must be done on every control plane node. Worker nodes must also be downgraded to the previous kubeadm and kubelet versions with kubeadm upgrade node and a kubelet restart.

Important: etcd restore rewrites cluster state. Any changes made after the backup was taken will be lost. This is why the backup should be taken immediately before the upgrade, and why fixing forward is almost always preferable to a full restore.

Automation Considerations#

For clusters with many worker nodes (10+), manual node-by-node upgrades are impractical. Consider Ansible playbooks with configurable parallelism, Cluster API (CAPI) with the kubeadm bootstrap provider for rolling machine replacements, or custom scripts with safety guards (node count checks, capacity verification, automatic stop on failure). Whatever automation you use, always upgrade and verify the first control plane node manually before proceeding.