Kubernetes Disaster Recovery Runbooks#
These runbooks cover the incidents you will encounter in production Kubernetes environments. Each follows the same structure: detection, diagnosis, recovery, and prevention. Print these out, bookmark them, put them in your on-call wiki. When the alert fires at 2 AM, you want a checklist, not a tutorial.
Incident Response Framework#
Every incident follows the same cycle:
- Detect – monitoring alert, user report, or kubectl showing unhealthy state
- Assess – determine scope and severity. Is it one pod, one node, or the entire cluster?
- Contain – stop the bleeding. Prevent the issue from spreading
- Recover – restore normal operation
- Post-mortem – document what happened, why, and how to prevent it
Runbook 1: Node Goes NotReady#
Detection: Node condition changes to Ready=False. Pods on the node are rescheduled (if using Deployments). Monitoring alerts on node status.
Diagnosis:
# Check node status and conditions
kubectl describe node <node-name>Look at the Conditions section:
| Condition | Meaning |
|---|---|
Ready=False |
Kubelet is not healthy or not communicating |
MemoryPressure=True |
Node is running out of memory |
DiskPressure=True |
Node is running out of disk |
PIDPressure=True |
Too many processes on the node |
NetworkUnavailable=True |
Network plugin not configured |
If you have SSH access to the node:
# Check kubelet
systemctl status kubelet
journalctl -u kubelet --since "15 minutes ago" --no-pager
# Check if the node can reach the API server
curl -k https://<api-server>:6443/healthz
# Check disk space
df -h
# Check memory
free -mRecovery:
# Restart kubelet if it is stuck
systemctl restart kubelet
# If the node is unrecoverable, drain it first (if it came back briefly)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Then remove it from the cluster
kubectl delete node <node-name>On cloud providers with auto-scaling groups (ASG on AWS, VMSS on Azure), the unhealthy node will be replaced automatically. Verify by watching for a new node to join:
kubectl get nodes -wPrevention: Set up node health monitoring. Use the node-problem-detector daemonset to surface kernel issues, container runtime problems, and hardware failures as node conditions.
Runbook 2: etcd Cluster Degraded or Quorum Lost#
Detection: API server returns errors (connection refused, etcdserver: leader changed). etcdctl endpoint health shows unhealthy members.
Diagnosis:
# Check etcd health (run on a control plane node)
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health --cluster
# Check member list
etcdctl member list --write-out=tableSingle member down (quorum maintained): A 3-member cluster survives 1 failure. The cluster continues to operate normally. Fix or replace the failed member:
# Remove the failed member
etcdctl member remove <member-id>
# Add a new member
etcdctl member add <new-name> --peer-urls=https://<new-ip>:2380Quorum lost (2 of 3 down): The cluster is read-only. No writes, no new pods, no updates. This is a critical incident.
Recovery from snapshot:
# On a healthy etcd node, restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--name=etcd-0 \
--initial-cluster=etcd-0=https://10.0.1.10:2380 \
--initial-cluster-token=etcd-cluster-restored \
--initial-advertise-peer-urls=https://10.0.1.10:2380 \
--data-dir=/var/lib/etcd-restored
# Stop etcd, replace data directory, restart
systemctl stop etcd
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd
systemctl start etcdPrevention: Take regular etcd snapshots. For production, run 5 members (survives 2 failures). Automate snapshots:
# Cron job for etcd backup
0 */6 * * * ETCDCTL_API=3 etcdctl snapshot save \
/backup/etcd-$(date +\%Y\%m\%d-\%H\%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyRunbook 3: Control Plane Down (Managed Kubernetes)#
Detection: kubectl commands timeout. New pods are not scheduled. But existing running pods continue to serve traffic.
This is an important detail: Kubernetes is designed so that the data plane survives control plane outages. Pods keep running, services keep routing, containers keep serving. You just cannot make changes.
Diagnosis:
# Check if the API server is reachable
kubectl cluster-info
# If this times out, the control plane is down
# Check cloud provider status
# AWS: https://health.aws.amazon.com
# Azure: https://status.azure.com
# GCP: https://status.cloud.google.comRecovery: For managed Kubernetes (EKS, AKS, GKE), the cloud provider is responsible for control plane availability. Open a support ticket with high severity. There is little you can do except wait.
What to communicate to stakeholders: “Existing services continue to run normally. We cannot deploy new changes or scale workloads until the control plane recovers. Running applications are not affected.”
Prevention: For critical workloads, run multi-cluster with failover. Do not put all workloads in a single cluster with a single control plane.
Runbook 4: Certificate Expiry#
Detection: Kubelet stops communicating with the API server. kubectl returns x509: certificate has expired or is not yet valid. Nodes go NotReady.
Diagnosis:
# Check certificate expiry dates
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
# For kubeadm clusters, check all certificates at once
kubeadm certs check-expirationOutput shows each certificate and its expiry date:
CERTIFICATE EXPIRES RESIDUAL TIME
admin.conf Feb 21, 2027 00:00 UTC 364d
apiserver Feb 21, 2027 00:00 UTC 364d
apiserver-etcd-client Feb 21, 2027 00:00 UTC 364d
apiserver-kubelet-client Feb 21, 2027 00:00 UTC 364dRecovery:
# Renew all certificates (kubeadm clusters)
kubeadm certs renew all
# Restart control plane components to pick up new certs
systemctl restart kubelet
# If using static pods for control plane, move manifests out and back
mv /etc/kubernetes/manifests/*.yaml /tmp/
# Wait 10 seconds for pods to stop
mv /tmp/*.yaml /etc/kubernetes/manifests/After renewal, update kubeconfig files on any machine that uses them:
# Regenerate admin.conf
kubeadm kubeconfig user --client-name=admin --org=system:masters > /etc/kubernetes/admin.conf
# Copy to your user's kubeconfig
cp /etc/kubernetes/admin.conf ~/.kube/configPrevention: Set up monitoring that alerts 30 days before expiry. cert-manager can automate certificate rotation. Prometheus can scrape certificate expiry metrics:
# Prometheus alerting rule
- alert: KubernetesCertificateExpiringSoon
expr: apiserver_client_certificate_expiration_seconds_count > 0
and histogram_quantile(0.01, rate(apiserver_client_certificate_expiration_seconds_bucket[5m])) < 2592000
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes certificate expiring in less than 30 days"Runbook 5: PVC Data Loss#
Detection: Application reports data missing. PVC status shows Lost or the PV has been deleted.
Assessment: Check the reclaim policy of the storage class:
kubectl get storageclass -o custom-columns=NAME:.metadata.name,RECLAIM:.reclaimPolicyIf reclaimPolicy is Delete, deleting a PVC also deletes the PV and the underlying storage volume. The data is gone unless you have backups or storage-level snapshots.
Recovery:
# Option 1: Restore from Velero backup
velero restore create --from-backup <backup-name> \
--include-resources persistentvolumeclaims,persistentvolumes \
--include-namespaces <namespace>
# Option 2: Restore from cloud provider snapshot
# AWS example: create a new EBS volume from snapshot, then create a PV pointing to it# Manual PV creation from existing volume
apiVersion: v1
kind: PersistentVolume
metadata:
name: restored-pv
spec:
capacity:
storage: 50Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
awsElasticBlockStore:
volumeID: vol-0abc123def456789
fsType: ext4Prevention:
# Always use Retain for important data
kubectl patch storageclass <name> -p '{"reclaimPolicy":"Retain"}'
# Schedule Velero backups
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--ttl 720hUse VolumeSnapshots for point-in-time recovery:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-snapshot
spec:
volumeSnapshotClassName: csi-snapclass
source:
persistentVolumeClaimName: postgres-dataRunbook 6: Deployment Rollback#
Detection: New deployment is causing errors, increased latency, or crashes.
Immediate action:
# Undo the last rollout
kubectl rollout undo deployment/<name> -n <namespace>
# Verify the rollback is progressing
kubectl rollout status deployment/<name> -n <namespace>If you need a specific revision:
# Check revision history
kubectl rollout history deployment/<name> -n <namespace>
# Rollback to a specific revision
kubectl rollout undo deployment/<name> -n <namespace> --to-revision=3
# Inspect what a specific revision contained
kubectl rollout history deployment/<name> -n <namespace> --revision=3For ArgoCD-managed deployments, rollback means reverting the git commit:
git revert <bad-commit>
git push
# ArgoCD will sync automatically (or trigger manual sync)Prevention: Use progressive delivery (canary or blue-green) so bad deployments affect only a fraction of traffic before full rollout. Set maxUnavailable: 0 in the deployment strategy so the old pods are not removed until new pods pass readiness checks.
Runbook 7: Namespace Stuck in Terminating#
Detection: kubectl delete namespace <ns> hangs. kubectl get namespace <ns> shows Terminating status indefinitely.
Diagnosis: Something in the namespace has a finalizer that cannot be processed.
# Find all remaining resources
kubectl api-resources --verbs=list --namespaced -o name | \
xargs -n1 -I{} sh -c 'echo "--- {}:" && kubectl get {} -n <ns> --no-headers 2>/dev/null'Recovery: For each stuck resource, remove its finalizers:
# Find resources with finalizers
kubectl get <resource-type> <name> -n <ns> -o jsonpath='{.metadata.finalizers}'
# Patch to remove finalizers
kubectl patch <resource-type> <name> -n <ns> \
-p '{"metadata":{"finalizers":null}}' --type=mergeIf no individual resources are visible but the namespace is still stuck, the namespace itself may have a finalizer blocking it. Patch it directly via the API:
kubectl get namespace <ns> -o json | \
jq '.spec.finalizers = []' | \
kubectl replace --raw "/api/v1/namespaces/<ns>/finalize" -f -Prevention: Before deleting a namespace, ensure all CRD controllers that manage resources in that namespace are still running. The most common cause is deleting a CRD operator before deleting the namespace it managed.
Post-Incident#
After every incident, document:
- Timeline – when was it detected, when was it resolved, total impact duration
- Root cause – what actually broke and why
- Impact – which services were affected, how many users impacted
- Resolution – exact steps taken to recover
- Action items – what changes will prevent recurrence (with owners and deadlines)
Store runbooks alongside monitoring configuration. The alert that fires should link directly to the runbook that resolves it.