Kubernetes Disaster Recovery Runbooks#

These runbooks cover the incidents you will encounter in production Kubernetes environments. Each follows the same structure: detection, diagnosis, recovery, and prevention. Print these out, bookmark them, put them in your on-call wiki. When the alert fires at 2 AM, you want a checklist, not a tutorial.

Incident Response Framework#

Every incident follows the same cycle:

  1. Detect – monitoring alert, user report, or kubectl showing unhealthy state
  2. Assess – determine scope and severity. Is it one pod, one node, or the entire cluster?
  3. Contain – stop the bleeding. Prevent the issue from spreading
  4. Recover – restore normal operation
  5. Post-mortem – document what happened, why, and how to prevent it

Runbook 1: Node Goes NotReady#

Detection: Node condition changes to Ready=False. Pods on the node are rescheduled (if using Deployments). Monitoring alerts on node status.

Diagnosis:

# Check node status and conditions
kubectl describe node <node-name>

Look at the Conditions section:

Condition Meaning
Ready=False Kubelet is not healthy or not communicating
MemoryPressure=True Node is running out of memory
DiskPressure=True Node is running out of disk
PIDPressure=True Too many processes on the node
NetworkUnavailable=True Network plugin not configured

If you have SSH access to the node:

# Check kubelet
systemctl status kubelet
journalctl -u kubelet --since "15 minutes ago" --no-pager

# Check if the node can reach the API server
curl -k https://<api-server>:6443/healthz

# Check disk space
df -h

# Check memory
free -m

Recovery:

# Restart kubelet if it is stuck
systemctl restart kubelet

# If the node is unrecoverable, drain it first (if it came back briefly)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Then remove it from the cluster
kubectl delete node <node-name>

On cloud providers with auto-scaling groups (ASG on AWS, VMSS on Azure), the unhealthy node will be replaced automatically. Verify by watching for a new node to join:

kubectl get nodes -w

Prevention: Set up node health monitoring. Use the node-problem-detector daemonset to surface kernel issues, container runtime problems, and hardware failures as node conditions.

Runbook 2: etcd Cluster Degraded or Quorum Lost#

Detection: API server returns errors (connection refused, etcdserver: leader changed). etcdctl endpoint health shows unhealthy members.

Diagnosis:

# Check etcd health (run on a control plane node)
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --cluster

# Check member list
etcdctl member list --write-out=table

Single member down (quorum maintained): A 3-member cluster survives 1 failure. The cluster continues to operate normally. Fix or replace the failed member:

# Remove the failed member
etcdctl member remove <member-id>

# Add a new member
etcdctl member add <new-name> --peer-urls=https://<new-ip>:2380

Quorum lost (2 of 3 down): The cluster is read-only. No writes, no new pods, no updates. This is a critical incident.

Recovery from snapshot:

# On a healthy etcd node, restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --name=etcd-0 \
  --initial-cluster=etcd-0=https://10.0.1.10:2380 \
  --initial-cluster-token=etcd-cluster-restored \
  --initial-advertise-peer-urls=https://10.0.1.10:2380 \
  --data-dir=/var/lib/etcd-restored

# Stop etcd, replace data directory, restart
systemctl stop etcd
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd
systemctl start etcd

Prevention: Take regular etcd snapshots. For production, run 5 members (survives 2 failures). Automate snapshots:

# Cron job for etcd backup
0 */6 * * * ETCDCTL_API=3 etcdctl snapshot save \
  /backup/etcd-$(date +\%Y\%m\%d-\%H\%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Runbook 3: Control Plane Down (Managed Kubernetes)#

Detection: kubectl commands timeout. New pods are not scheduled. But existing running pods continue to serve traffic.

This is an important detail: Kubernetes is designed so that the data plane survives control plane outages. Pods keep running, services keep routing, containers keep serving. You just cannot make changes.

Diagnosis:

# Check if the API server is reachable
kubectl cluster-info
# If this times out, the control plane is down

# Check cloud provider status
# AWS: https://health.aws.amazon.com
# Azure: https://status.azure.com
# GCP: https://status.cloud.google.com

Recovery: For managed Kubernetes (EKS, AKS, GKE), the cloud provider is responsible for control plane availability. Open a support ticket with high severity. There is little you can do except wait.

What to communicate to stakeholders: “Existing services continue to run normally. We cannot deploy new changes or scale workloads until the control plane recovers. Running applications are not affected.”

Prevention: For critical workloads, run multi-cluster with failover. Do not put all workloads in a single cluster with a single control plane.

Runbook 4: Certificate Expiry#

Detection: Kubelet stops communicating with the API server. kubectl returns x509: certificate has expired or is not yet valid. Nodes go NotReady.

Diagnosis:

# Check certificate expiry dates
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

# For kubeadm clusters, check all certificates at once
kubeadm certs check-expiration

Output shows each certificate and its expiry date:

CERTIFICATE                EXPIRES                  RESIDUAL TIME
admin.conf                 Feb 21, 2027 00:00 UTC   364d
apiserver                  Feb 21, 2027 00:00 UTC   364d
apiserver-etcd-client      Feb 21, 2027 00:00 UTC   364d
apiserver-kubelet-client   Feb 21, 2027 00:00 UTC   364d

Recovery:

# Renew all certificates (kubeadm clusters)
kubeadm certs renew all

# Restart control plane components to pick up new certs
systemctl restart kubelet

# If using static pods for control plane, move manifests out and back
mv /etc/kubernetes/manifests/*.yaml /tmp/
# Wait 10 seconds for pods to stop
mv /tmp/*.yaml /etc/kubernetes/manifests/

After renewal, update kubeconfig files on any machine that uses them:

# Regenerate admin.conf
kubeadm kubeconfig user --client-name=admin --org=system:masters > /etc/kubernetes/admin.conf

# Copy to your user's kubeconfig
cp /etc/kubernetes/admin.conf ~/.kube/config

Prevention: Set up monitoring that alerts 30 days before expiry. cert-manager can automate certificate rotation. Prometheus can scrape certificate expiry metrics:

# Prometheus alerting rule
- alert: KubernetesCertificateExpiringSoon
  expr: apiserver_client_certificate_expiration_seconds_count > 0
    and histogram_quantile(0.01, rate(apiserver_client_certificate_expiration_seconds_bucket[5m])) < 2592000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Kubernetes certificate expiring in less than 30 days"

Runbook 5: PVC Data Loss#

Detection: Application reports data missing. PVC status shows Lost or the PV has been deleted.

Assessment: Check the reclaim policy of the storage class:

kubectl get storageclass -o custom-columns=NAME:.metadata.name,RECLAIM:.reclaimPolicy

If reclaimPolicy is Delete, deleting a PVC also deletes the PV and the underlying storage volume. The data is gone unless you have backups or storage-level snapshots.

Recovery:

# Option 1: Restore from Velero backup
velero restore create --from-backup <backup-name> \
  --include-resources persistentvolumeclaims,persistentvolumes \
  --include-namespaces <namespace>

# Option 2: Restore from cloud provider snapshot
# AWS example: create a new EBS volume from snapshot, then create a PV pointing to it
# Manual PV creation from existing volume
apiVersion: v1
kind: PersistentVolume
metadata:
  name: restored-pv
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  awsElasticBlockStore:
    volumeID: vol-0abc123def456789
    fsType: ext4

Prevention:

# Always use Retain for important data
kubectl patch storageclass <name> -p '{"reclaimPolicy":"Retain"}'

# Schedule Velero backups
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --ttl 720h

Use VolumeSnapshots for point-in-time recovery:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snapshot
spec:
  volumeSnapshotClassName: csi-snapclass
  source:
    persistentVolumeClaimName: postgres-data

Runbook 6: Deployment Rollback#

Detection: New deployment is causing errors, increased latency, or crashes.

Immediate action:

# Undo the last rollout
kubectl rollout undo deployment/<name> -n <namespace>

# Verify the rollback is progressing
kubectl rollout status deployment/<name> -n <namespace>

If you need a specific revision:

# Check revision history
kubectl rollout history deployment/<name> -n <namespace>

# Rollback to a specific revision
kubectl rollout undo deployment/<name> -n <namespace> --to-revision=3

# Inspect what a specific revision contained
kubectl rollout history deployment/<name> -n <namespace> --revision=3

For ArgoCD-managed deployments, rollback means reverting the git commit:

git revert <bad-commit>
git push
# ArgoCD will sync automatically (or trigger manual sync)

Prevention: Use progressive delivery (canary or blue-green) so bad deployments affect only a fraction of traffic before full rollout. Set maxUnavailable: 0 in the deployment strategy so the old pods are not removed until new pods pass readiness checks.

Runbook 7: Namespace Stuck in Terminating#

Detection: kubectl delete namespace <ns> hangs. kubectl get namespace <ns> shows Terminating status indefinitely.

Diagnosis: Something in the namespace has a finalizer that cannot be processed.

# Find all remaining resources
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -n1 -I{} sh -c 'echo "--- {}:" && kubectl get {} -n <ns> --no-headers 2>/dev/null'

Recovery: For each stuck resource, remove its finalizers:

# Find resources with finalizers
kubectl get <resource-type> <name> -n <ns> -o jsonpath='{.metadata.finalizers}'

# Patch to remove finalizers
kubectl patch <resource-type> <name> -n <ns> \
  -p '{"metadata":{"finalizers":null}}' --type=merge

If no individual resources are visible but the namespace is still stuck, the namespace itself may have a finalizer blocking it. Patch it directly via the API:

kubectl get namespace <ns> -o json | \
  jq '.spec.finalizers = []' | \
  kubectl replace --raw "/api/v1/namespaces/<ns>/finalize" -f -

Prevention: Before deleting a namespace, ensure all CRD controllers that manage resources in that namespace are still running. The most common cause is deleting a CRD operator before deleting the namespace it managed.

Post-Incident#

After every incident, document:

  1. Timeline – when was it detected, when was it resolved, total impact duration
  2. Root cause – what actually broke and why
  3. Impact – which services were affected, how many users impacted
  4. Resolution – exact steps taken to recover
  5. Action items – what changes will prevent recurrence (with owners and deadlines)

Store runbooks alongside monitoring configuration. The alert that fires should link directly to the runbook that resolves it.