Choosing a Kubernetes Backup Strategy#
Kubernetes clusters contain two fundamentally different types of state: cluster state (the Kubernetes objects themselves – Deployments, Services, ConfigMaps, Secrets, CRDs) and application data (the contents of Persistent Volumes). A complete backup strategy must address both. Most backup failures happen because teams back up one but not the other, or because they never test the restore process.
What Needs Backing Up#
Before choosing tools, inventory what your cluster contains:
Cluster state includes all Kubernetes API objects: Deployments, StatefulSets, Services, Ingresses, ConfigMaps, Secrets, CRDs, Custom Resources, RBAC roles and bindings, NetworkPolicies, and PersistentVolumeClaims. This state is stored in etcd. Losing it means losing the entire cluster configuration.
Application data lives on Persistent Volumes. This includes database files (PostgreSQL data directory, MySQL data directory), message queue storage (Kafka logs, RabbitMQ data), application uploads and caches, and any data written to PVCs. Losing this means losing user data.
External dependencies include DNS records, TLS certificates (if not managed by cert-manager), cloud load balancers, IAM roles, and cloud provider resources. These are typically managed by IaC tools (Terraform, Crossplane) and are outside the scope of Kubernetes backup tools.
Backup Layers#
Cluster Level: etcd Snapshots#
etcd is the backing store for all Kubernetes API objects. An etcd snapshot captures the complete cluster state at a point in time.
# Create an etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keyChoose etcd snapshots for:
- Disaster recovery baseline for self-managed clusters (kubeadm, k3s, RKE)
- Complete cluster rebuild after catastrophic failure
- Compliance requirements that mandate cluster state backups
Limitations: etcd snapshots are all-or-nothing. You cannot restore a single namespace or a single Deployment from an etcd snapshot – you restore the entire cluster state. This makes etcd snapshots unsuitable for application-level recovery. On managed Kubernetes (EKS, GKE, AKS), the provider manages etcd and typically handles snapshots automatically – you do not have direct etcd access.
Resource Level: Velero#
Velero is the most widely adopted Kubernetes backup tool. It backs up Kubernetes resources (API objects) and optionally creates PV snapshots or uses file-level backup (via Restic or Kopia) for Persistent Volume data. Backups can be scoped by namespace, label selector, or resource type.
# Back up an entire namespace
velero backup create my-app-backup --include-namespaces my-app
# Back up specific resources by label
velero backup create db-backup --selector app=postgresql
# Schedule recurring backups with retention
velero schedule create daily-backup --schedule="0 2 * * *" \
--include-namespaces my-app --ttl 720hChoose Velero when:
- You need namespace-level or label-based backup scope – back up specific applications without touching others
- Cluster migration is a use case – Velero can back up from one cluster and restore to another
- You want scheduled backups with retention policies (automatic deletion of old backups)
- You need both resource backup and PV data backup in a coordinated operation
- You use cloud provider storage for backup destination (S3, GCS, Azure Blob) with cross-region durability
- You want pre/post backup hooks to run commands before or after backup (for example, flushing a database’s write-ahead log before snapshotting)
Limitations: PV snapshots taken by Velero are crash-consistent, not application-consistent. This means the snapshot captures the disk state at the moment of the snapshot, similar to pulling the power cord. For databases, this may result in data that requires crash recovery on restore. Velero’s Restic/Kopia file-level backup is slower than volume snapshots but works with any storage provider. Restore order can be tricky with CRDs – if a Custom Resource is restored before its CRD definition, the restore fails for that resource. Velero’s resource backup captures the state at backup time, which may include stale data (pods in CrashLoopBackOff, completed Jobs).
Volume Level: CSI VolumeSnapshots#
The VolumeSnapshot API is a Kubernetes-native mechanism for creating point-in-time copies of PersistentVolumes. The storage provider’s CSI driver creates the snapshot at the block level.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-data-snapshot
spec:
volumeSnapshotClassName: csi-hostpath-snapclass
source:
persistentVolumeClaimName: postgres-dataChoose CSI VolumeSnapshots when:
- You need fast, block-level point-in-time copies of specific PVs
- You combine VolumeSnapshots with Velero for coordinated resource + data backup
- You want to clone PVs for testing (restore a snapshot to a new PVC, run tests against production data)
- Your CSI driver supports snapshots (EBS CSI, GCE PD CSI, Azure Disk CSI, Ceph CSI, Longhorn)
Limitations: VolumeSnapshots are scoped to the same cluster and storage class. They are not inherently cross-cluster portable (though some CSI drivers support cross-region snapshot copy). Like Velero PV snapshots, they are crash-consistent, not application-consistent. VolumeSnapshots do not back up Kubernetes resources – only the PV data. Not all CSI drivers support snapshots, and behavior varies across providers.
Application Level: Database Dumps and Exports#
Application-level backups use the application’s own export mechanisms: pg_dump for PostgreSQL, mysqldump for MySQL, mongodump for MongoDB, or application-specific export APIs.
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:16
command:
- /bin/bash
- -c
- |
pg_dump -h postgres-primary -U backup_user -Fc mydb \
| gzip > /backups/mydb-$(date +%Y%m%d-%H%M%S).dump.gz
volumeMounts:
- name: backup-storage
mountPath: /backups
restartPolicy: OnFailure
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvcChoose application-level backups when:
- You need application-consistent backups –
pg_dumpproduces a logically consistent snapshot that does not require crash recovery - Point-in-time recovery (PITR) is a requirement – WAL archiving for PostgreSQL, binary log for MySQL enable recovery to any point in time
- You need logical backups that are portable across database versions (a
pg_dumpfrom PostgreSQL 15 can be restored on PostgreSQL 16) - You want granular recovery – restore a single table or a subset of data
- Cross-platform portability matters – a SQL dump restores on any PostgreSQL instance regardless of the underlying storage
Limitations: Application-level backups are slower than snapshots, especially for large databases. They require per-application configuration – each database, message queue, or stateful application needs its own backup CronJob. Monitoring backup success requires checking Job status and validating backup file integrity. Large databases (hundreds of GBs) may take hours to dump, during which the backup represents a moving target (use pg_dump with --serializable-deferrable for consistent reads).
Comparison Table#
| Criteria | etcd Snapshot | Velero | CSI VolumeSnapshot | Application-Level |
|---|---|---|---|---|
| Scope | Entire cluster state | Namespace/label/resource | Single PV | Single application |
| Consistency | Point-in-time (cluster state) | Crash-consistent (PV data) | Crash-consistent | Application-consistent |
| Restore granularity | All-or-nothing | Namespace, resource type, label | Single PV | Table, database, or custom |
| Speed (backup) | Fast (seconds) | Medium (minutes) | Fast (seconds to minutes) | Slow (minutes to hours) |
| Speed (restore) | Fast | Medium | Fast | Slow |
| Cross-cluster portable | No (cluster-specific) | Yes | Depends on CSI driver | Yes |
| Backs up K8s resources | Yes (all) | Yes (selectable) | No | No |
| Backs up PV data | No | Yes (snapshot or file-level) | Yes | Yes (application data) |
| PITR support | No | No | No | Yes (with WAL/binlog archiving) |
| Operational complexity | Low | Medium | Low | High (per-application) |
The Layered Strategy#
No single backup mechanism covers all requirements. The recommended approach combines multiple layers, each serving a specific recovery scenario.
Layer 1: etcd snapshots (for self-managed clusters). Schedule etcd snapshots every 6-12 hours. Store them off-cluster (S3, GCS, NFS). This is your cluster-level disaster recovery baseline. If the entire cluster is lost, you can rebuild from the etcd snapshot.
Layer 2: Velero for namespace-level backup. Schedule Velero backups for each application namespace daily. Include resource backup and PV snapshots. Store backups in a different region than the cluster. This gives you namespace-level recovery – if an application’s namespace is accidentally deleted or corrupted, Velero restores it.
Layer 3: Application-level dumps for databases. Schedule pg_dump, mysqldump, or equivalent for every database. Run every 6 hours for production data. Enable WAL archiving (PostgreSQL) or binary log shipping (MySQL) for point-in-time recovery between dumps. Store dumps in object storage with lifecycle policies for retention.
This layered approach means: cluster-level disaster is covered by etcd snapshots, application-level disaster is covered by Velero, and data-level recovery (restore to a specific point in time, recover a single table) is covered by application-level backups.
Testing: The Backup That Is Not Tested Does Not Exist#
The most common backup failure mode is discovering during a real incident that the backup is corrupted, incomplete, or that the restore process has never been validated. Schedule regular restore drills:
- Monthly: Restore a Velero backup to a separate namespace or test cluster. Verify that applications start and data is accessible.
- Quarterly: Restore an application-level database backup to a test instance. Run application integration tests against it to verify data integrity.
- Annually (or after major changes): For self-managed clusters, practice a full cluster rebuild from etcd snapshot. Time it. Document the steps that were not automated.
Automate what you can. A CronJob that restores the latest pg_dump to a test database and runs a verification query is better than a manual process that gets skipped when the team is busy.
Common Mistakes#
Backing up PVs but not Kubernetes resources. A Persistent Volume without its PersistentVolumeClaim, the Deployment that mounts it, the Secret with database credentials, and the Service that routes traffic to it is useless. Always back up resources and data together.
Relying solely on etcd snapshots. etcd snapshots restore everything or nothing. You cannot recover a single namespace or undo an accidental kubectl delete from an etcd snapshot without restoring the entire cluster state, which overwrites everything else.
Assuming crash-consistent snapshots are sufficient for databases. A crash-consistent snapshot of a PostgreSQL data directory requires PostgreSQL’s crash recovery to run on startup. This usually works, but “usually” is not a backup guarantee. Application-level backups (pg_dump) produce logically consistent backups that do not require crash recovery.
Not backing up Secrets and ConfigMaps separately. If Secrets are not included in your backup scope (or if they are encrypted at rest and the encryption key is lost), restoring application deployments is useless – the applications cannot start without their configuration and credentials.
Ignoring backup storage durability. Storing backups on the same storage system as the data they protect defeats the purpose. If the storage cluster fails, you lose both the data and the backup. Use cross-region object storage (S3 with cross-region replication, GCS multi-region buckets) for backup destinations.