Stateful Workload Disaster Recovery#

Stateless workloads are easy to recover – redeploy from Git and they are running. Stateful workloads carry data that cannot be regenerated. Databases, message queues, object stores, and anything with a PersistentVolume needs a deliberate DR strategy that goes beyond “we have Velero.”

The fundamental challenge: you must capture data at a point in time where the application state is consistent, replicate that data to a recovery site, and restore it in the correct order. Get any of these wrong and you recover corrupted data or a broken dependency chain.

PersistentVolume Snapshot Strategies#

CSI VolumeSnapshots#

CSI snapshots are the Kubernetes-native way to snapshot PVs. They work with any CSI driver that supports the snapshot feature (EBS CSI, GCE PD CSI, Azure Disk CSI, Longhorn, Ceph).

# First, create a VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-snapclass
driver: ebs.csi.aws.com    # Match your CSI driver
deletionPolicy: Retain       # Keep snapshot when VolumeSnapshot object is deleted
---
# Take a snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-data-snap-20260222
  namespace: production
spec:
  volumeSnapshotClassName: csi-snapclass
  source:
    persistentVolumeClaimName: data-postgres-0

Restore from a snapshot by creating a new PVC that references it:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-postgres-0-restored
  namespace: production
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: gp3
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: postgres-data-snap-20260222
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Cloud-Provider Snapshots#

Cloud snapshots happen at the block storage level. They are faster than file-level backups and can be copied cross-region for DR.

# AWS: snapshot an EBS volume and copy to another region
SNAP_ID=$(aws ec2 create-snapshot --volume-id vol-0abc123 --description "postgres DR" --query SnapshotId --output text)
aws ec2 copy-snapshot --source-region us-east-1 --source-snapshot-id $SNAP_ID --destination-region eu-west-1

Automate cross-region snapshot copies with AWS DLM (Data Lifecycle Manager) or equivalent. Without automation, cross-region copies are forgotten within weeks.

Application-Consistent vs Crash-Consistent Backups#

This is the distinction that matters most for databases.

Crash-consistent: A snapshot taken at an arbitrary point in time. The volume captures whatever was on disk at that instant, including half-written pages and uncommitted transactions. This is what you get from a raw CSI snapshot or cloud volume snapshot while the database is running.

Application-consistent: The application is quiesced before the snapshot. For databases, this means flushing dirty pages to disk, checkpointing the WAL, and ensuring the data directory is in a recoverable state.

A crash-consistent snapshot of PostgreSQL will usually recover – PostgreSQL replays the WAL on startup. But “usually” is not good enough for production DR. Some databases (MySQL with MyISAM tables, older MongoDB) can produce unrecoverable snapshots from crash-consistent backups.

PostgreSQL Application-Consistent Snapshot#

# Freeze writes, take snapshot, thaw
kubectl exec -n production postgres-0 -- psql -c "SELECT pg_backup_start('dr-snapshot');"

# Take the CSI snapshot while writes are frozen
kubectl apply -f volume-snapshot.yaml

# Thaw writes
kubectl exec -n production postgres-0 -- psql -c "SELECT pg_backup_stop();"

Velero Pre/Post Backup Hooks#

Velero supports hooks that run commands in pods before and after backup:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    pre.hook.backup.velero.io/container: postgres
    pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "pg_backup_start(''velero'')"]'
    post.hook.backup.velero.io/container: postgres
    post.hook.backup.velero.io/command: '["/bin/bash", "-c", "pg_backup_stop()"]'

For MySQL:

pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "mysql -u root -e \"FLUSH TABLES WITH READ LOCK;\""]'
post.hook.backup.velero.io/command: '["/bin/bash", "-c", "mysql -u root -e \"UNLOCK TABLES;\""]'

Storage Replication for Cross-Cluster DR#

Portworx#

Portworx supports synchronous and asynchronous replication between clusters. Asynchronous replication (PX-DR) sends incremental snapshots to a remote cluster on a schedule.

# Create a replication schedule
storkctl create migration-schedule postgres-dr \
  --cluster-pair remote-cluster \
  --namespaces production \
  --interval 15  # minutes

Portworx also supports synchronous replication (PX-Metro) for zero RPO, but this requires low-latency (<10ms) connections between sites – essentially the same data center or metro area.

Longhorn#

Longhorn supports DR volumes that replicate to an S3-compatible backup target. The secondary cluster mounts the DR volume in standby mode.

apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  name: postgres-data
spec:
  numberOfReplicas: 3
  recurringJobs:
  - name: backup-every-15m
    task: backup
    cron: "*/15 * * * *"
    retain: 10
    labels:
      type: dr

On the DR cluster, create a DR volume pointing to the same backup target:

apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  name: postgres-data-dr
spec:
  fromBackup: "s3://longhorn-backups@us-east-1/backups/postgres-data"
  standby: true

To activate: set standby: false and attach the volume. Longhorn replays the latest backup and the volume becomes read-write.

Rook-Ceph Cross-Cluster#

Rook-Ceph supports RBD mirroring between two Ceph clusters. This provides block-level replication for PVs backed by Ceph.

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicated-pool
spec:
  replicated:
    size: 3
  mirroring:
    enabled: true
    mode: image
    snapshotSchedules:
    - interval: 5m

Bootstrap the mirror peer between clusters, and Ceph replicates RBD images asynchronously. RPO depends on the snapshot schedule interval.

Database Operator DR#

Modern database operators handle DR natively. Use them instead of building custom snapshot pipelines.

CloudNativePG (PostgreSQL)#

CloudNativePG supports continuous backup to object storage and point-in-time recovery (PITR):

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: production-pg
spec:
  instances: 3
  backup:
    barmanObjectStore:
      destinationPath: s3://pg-backups/production
      s3Credentials:
        accessKeyID:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
    retentionPolicy: "30d"
  scheduledBackups:
  - name: daily
    schedule: "0 2 * * *"
    backupOwnerReference: self

Restore to a DR cluster by creating a new Cluster resource pointing to the backup location:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: production-pg-restored
spec:
  instances: 3
  bootstrap:
    recovery:
      source: production-pg
      recoveryTarget:
        targetTime: "2026-02-22T06:00:00Z"   # Point-in-time
  externalClusters:
  - name: production-pg
    barmanObjectStore:
      destinationPath: s3://pg-backups/production
      s3Credentials:
        accessKeyID: { name: aws-creds, key: ACCESS_KEY_ID }
        secretAccessKey: { name: aws-creds, key: SECRET_ACCESS_KEY }

Percona Operator (MySQL/MongoDB)#

Percona XtraDB Cluster Operator supports scheduled backups to S3:

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
  name: daily-backup
spec:
  pxcCluster: production-mysql
  storageName: s3-backup

MongoDB Community Operator#

The MongoDB Community Operator does not include built-in backup CRDs. Use Percona Backup for MongoDB (PBM) as a sidecar or external tool for consistent backups with PITR support.

Message Queue DR#

Kafka MirrorMaker 2#

MirrorMaker 2 replicates topics between Kafka clusters. Deploy it as a KafkaConnect resource with Strimzi:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: dr-mirror
spec:
  version: 3.7.0
  replicas: 3
  connectCluster: dr-cluster
  clusters:
  - alias: primary
    bootstrapServers: primary-kafka-bootstrap:9092
  - alias: dr-cluster
    bootstrapServers: dr-kafka-bootstrap:9092
  mirrors:
  - sourceCluster: primary
    targetCluster: dr-cluster
    topicsPattern: ".*"
    groupsPattern: ".*"

MirrorMaker replicates topic data, consumer group offsets, and ACLs. RPO depends on replication lag, typically seconds.

RabbitMQ Shovel#

Shovel moves messages from a queue on one broker to a queue on another. Configure it as a policy for DR:

rabbitmqctl set_parameter shovel dr-orders \
  '{"src-protocol": "amqp091", "src-uri": "amqp://primary:5672", "src-queue": "orders",
    "dest-protocol": "amqp091", "dest-uri": "amqp://dr-site:5672", "dest-queue": "orders"}'

Shovel is point-to-point. For full cluster replication, use Federation or configure Shovel for each critical queue.

The Ordering Problem#

This is where most DR recoveries fail in practice. Kubernetes resources have dependencies, and restoring them in the wrong order produces errors that cascade.

The correct restore order:

  1. Namespaces and RBAC – everything depends on namespaces existing
  2. CRDs – operators need their CRDs before they can reconcile
  3. Operators – install and wait for them to be ready
  4. Storage – PVCs, restore PV snapshots, wait for volumes to bind
  5. Databases – restore data, wait for them to become ready
  6. Message queues – restore data, wait for cluster formation
  7. Application workloads – deploy services that depend on databases and queues
  8. Ingress and DNS – only route traffic once everything is healthy

Velero restores in a defined order (namespaces, then CRDs, then cluster-scoped, then namespaced), but it does not wait for readiness between steps. A database pod may be “restored” (the Pod object exists) but not yet accepting connections when the application pods start trying to connect.

Handle this with init containers that check dependencies:

initContainers:
- name: wait-for-postgres
  image: busybox:1.36
  command: ['sh', '-c', 'until nc -z postgres.production.svc 5432; do echo waiting; sleep 5; done']

Or use Kubernetes startup probes with generous timeouts for applications that connect to databases on startup. The application will crashloop until the database is ready, and Kubernetes will keep restarting it – this is ugly but functional.

The better approach: restore infrastructure (storage, databases, queues) first, validate health, then restore application workloads in a second pass. Two-phase restore is more work to automate but significantly more reliable than hoping everything comes up in the right order.