Stateful Workload Disaster Recovery#
Stateless workloads are easy to recover – redeploy from Git and they are running. Stateful workloads carry data that cannot be regenerated. Databases, message queues, object stores, and anything with a PersistentVolume needs a deliberate DR strategy that goes beyond “we have Velero.”
The fundamental challenge: you must capture data at a point in time where the application state is consistent, replicate that data to a recovery site, and restore it in the correct order. Get any of these wrong and you recover corrupted data or a broken dependency chain.
PersistentVolume Snapshot Strategies#
CSI VolumeSnapshots#
CSI snapshots are the Kubernetes-native way to snapshot PVs. They work with any CSI driver that supports the snapshot feature (EBS CSI, GCE PD CSI, Azure Disk CSI, Longhorn, Ceph).
# First, create a VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-snapclass
driver: ebs.csi.aws.com # Match your CSI driver
deletionPolicy: Retain # Keep snapshot when VolumeSnapshot object is deleted
---
# Take a snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-data-snap-20260222
namespace: production
spec:
volumeSnapshotClassName: csi-snapclass
source:
persistentVolumeClaimName: data-postgres-0Restore from a snapshot by creating a new PVC that references it:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-postgres-0-restored
namespace: production
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 100Gi
dataSource:
name: postgres-data-snap-20260222
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.ioCloud-Provider Snapshots#
Cloud snapshots happen at the block storage level. They are faster than file-level backups and can be copied cross-region for DR.
# AWS: snapshot an EBS volume and copy to another region
SNAP_ID=$(aws ec2 create-snapshot --volume-id vol-0abc123 --description "postgres DR" --query SnapshotId --output text)
aws ec2 copy-snapshot --source-region us-east-1 --source-snapshot-id $SNAP_ID --destination-region eu-west-1Automate cross-region snapshot copies with AWS DLM (Data Lifecycle Manager) or equivalent. Without automation, cross-region copies are forgotten within weeks.
Application-Consistent vs Crash-Consistent Backups#
This is the distinction that matters most for databases.
Crash-consistent: A snapshot taken at an arbitrary point in time. The volume captures whatever was on disk at that instant, including half-written pages and uncommitted transactions. This is what you get from a raw CSI snapshot or cloud volume snapshot while the database is running.
Application-consistent: The application is quiesced before the snapshot. For databases, this means flushing dirty pages to disk, checkpointing the WAL, and ensuring the data directory is in a recoverable state.
A crash-consistent snapshot of PostgreSQL will usually recover – PostgreSQL replays the WAL on startup. But “usually” is not good enough for production DR. Some databases (MySQL with MyISAM tables, older MongoDB) can produce unrecoverable snapshots from crash-consistent backups.
PostgreSQL Application-Consistent Snapshot#
# Freeze writes, take snapshot, thaw
kubectl exec -n production postgres-0 -- psql -c "SELECT pg_backup_start('dr-snapshot');"
# Take the CSI snapshot while writes are frozen
kubectl apply -f volume-snapshot.yaml
# Thaw writes
kubectl exec -n production postgres-0 -- psql -c "SELECT pg_backup_stop();"Velero Pre/Post Backup Hooks#
Velero supports hooks that run commands in pods before and after backup:
apiVersion: v1
kind: Pod
metadata:
annotations:
pre.hook.backup.velero.io/container: postgres
pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "pg_backup_start(''velero'')"]'
post.hook.backup.velero.io/container: postgres
post.hook.backup.velero.io/command: '["/bin/bash", "-c", "pg_backup_stop()"]'For MySQL:
pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "mysql -u root -e \"FLUSH TABLES WITH READ LOCK;\""]'
post.hook.backup.velero.io/command: '["/bin/bash", "-c", "mysql -u root -e \"UNLOCK TABLES;\""]'Storage Replication for Cross-Cluster DR#
Portworx#
Portworx supports synchronous and asynchronous replication between clusters. Asynchronous replication (PX-DR) sends incremental snapshots to a remote cluster on a schedule.
# Create a replication schedule
storkctl create migration-schedule postgres-dr \
--cluster-pair remote-cluster \
--namespaces production \
--interval 15 # minutesPortworx also supports synchronous replication (PX-Metro) for zero RPO, but this requires low-latency (<10ms) connections between sites – essentially the same data center or metro area.
Longhorn#
Longhorn supports DR volumes that replicate to an S3-compatible backup target. The secondary cluster mounts the DR volume in standby mode.
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: postgres-data
spec:
numberOfReplicas: 3
recurringJobs:
- name: backup-every-15m
task: backup
cron: "*/15 * * * *"
retain: 10
labels:
type: drOn the DR cluster, create a DR volume pointing to the same backup target:
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: postgres-data-dr
spec:
fromBackup: "s3://longhorn-backups@us-east-1/backups/postgres-data"
standby: trueTo activate: set standby: false and attach the volume. Longhorn replays the latest backup and the volume becomes read-write.
Rook-Ceph Cross-Cluster#
Rook-Ceph supports RBD mirroring between two Ceph clusters. This provides block-level replication for PVs backed by Ceph.
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicated-pool
spec:
replicated:
size: 3
mirroring:
enabled: true
mode: image
snapshotSchedules:
- interval: 5mBootstrap the mirror peer between clusters, and Ceph replicates RBD images asynchronously. RPO depends on the snapshot schedule interval.
Database Operator DR#
Modern database operators handle DR natively. Use them instead of building custom snapshot pipelines.
CloudNativePG (PostgreSQL)#
CloudNativePG supports continuous backup to object storage and point-in-time recovery (PITR):
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: production-pg
spec:
instances: 3
backup:
barmanObjectStore:
destinationPath: s3://pg-backups/production
s3Credentials:
accessKeyID:
name: aws-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: aws-creds
key: SECRET_ACCESS_KEY
retentionPolicy: "30d"
scheduledBackups:
- name: daily
schedule: "0 2 * * *"
backupOwnerReference: selfRestore to a DR cluster by creating a new Cluster resource pointing to the backup location:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: production-pg-restored
spec:
instances: 3
bootstrap:
recovery:
source: production-pg
recoveryTarget:
targetTime: "2026-02-22T06:00:00Z" # Point-in-time
externalClusters:
- name: production-pg
barmanObjectStore:
destinationPath: s3://pg-backups/production
s3Credentials:
accessKeyID: { name: aws-creds, key: ACCESS_KEY_ID }
secretAccessKey: { name: aws-creds, key: SECRET_ACCESS_KEY }Percona Operator (MySQL/MongoDB)#
Percona XtraDB Cluster Operator supports scheduled backups to S3:
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
name: daily-backup
spec:
pxcCluster: production-mysql
storageName: s3-backupMongoDB Community Operator#
The MongoDB Community Operator does not include built-in backup CRDs. Use Percona Backup for MongoDB (PBM) as a sidecar or external tool for consistent backups with PITR support.
Message Queue DR#
Kafka MirrorMaker 2#
MirrorMaker 2 replicates topics between Kafka clusters. Deploy it as a KafkaConnect resource with Strimzi:
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
name: dr-mirror
spec:
version: 3.7.0
replicas: 3
connectCluster: dr-cluster
clusters:
- alias: primary
bootstrapServers: primary-kafka-bootstrap:9092
- alias: dr-cluster
bootstrapServers: dr-kafka-bootstrap:9092
mirrors:
- sourceCluster: primary
targetCluster: dr-cluster
topicsPattern: ".*"
groupsPattern: ".*"MirrorMaker replicates topic data, consumer group offsets, and ACLs. RPO depends on replication lag, typically seconds.
RabbitMQ Shovel#
Shovel moves messages from a queue on one broker to a queue on another. Configure it as a policy for DR:
rabbitmqctl set_parameter shovel dr-orders \
'{"src-protocol": "amqp091", "src-uri": "amqp://primary:5672", "src-queue": "orders",
"dest-protocol": "amqp091", "dest-uri": "amqp://dr-site:5672", "dest-queue": "orders"}'Shovel is point-to-point. For full cluster replication, use Federation or configure Shovel for each critical queue.
The Ordering Problem#
This is where most DR recoveries fail in practice. Kubernetes resources have dependencies, and restoring them in the wrong order produces errors that cascade.
The correct restore order:
- Namespaces and RBAC – everything depends on namespaces existing
- CRDs – operators need their CRDs before they can reconcile
- Operators – install and wait for them to be ready
- Storage – PVCs, restore PV snapshots, wait for volumes to bind
- Databases – restore data, wait for them to become ready
- Message queues – restore data, wait for cluster formation
- Application workloads – deploy services that depend on databases and queues
- Ingress and DNS – only route traffic once everything is healthy
Velero restores in a defined order (namespaces, then CRDs, then cluster-scoped, then namespaced), but it does not wait for readiness between steps. A database pod may be “restored” (the Pod object exists) but not yet accepting connections when the application pods start trying to connect.
Handle this with init containers that check dependencies:
initContainers:
- name: wait-for-postgres
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres.production.svc 5432; do echo waiting; sleep 5; done']Or use Kubernetes startup probes with generous timeouts for applications that connect to databases on startup. The application will crashloop until the database is ready, and Kubernetes will keep restarting it – this is ugly but functional.
The better approach: restore infrastructure (storage, databases, queues) first, validate health, then restore application workloads in a second pass. Two-phase restore is more work to automate but significantly more reliable than hoping everything comes up in the right order.