Choosing Kubernetes Storage#
Storage decisions in Kubernetes are harder to change than almost any other architectural choice. Migrating data between storage backends in production involves downtime, risk, and careful planning. Understand the tradeoffs before provisioning your first PersistentVolumeClaim.
The decision comes down to five criteria: performance (IOPS and latency), durability (can you survive node failure), portability (can you move the workload), cost, and access mode (single pod or shared).
Storage Categories#
Block Storage (ReadWriteOnce)#
Block storage provides a raw disk attached to a single node. Only one pod on that node can mount it at a time (ReadWriteOnce). This is the most common storage type for databases, caches, and any workload that needs fast, consistent disk I/O.
Cloud Block Storage#
Cloud providers offer managed block devices that attach to VMs over the network. They are durable (replicated across availability zones), snapshottable, and resizable.
| Provider | Service | CSI Driver | Typical IOPS (gp-class) | Max IOPS | Latency |
|---|---|---|---|---|---|
| AWS | EBS | ebs.csi.aws.com | 3,000 (gp3) | 16,000 (gp3), 256,000 (io2) | Sub-ms to low ms |
| Azure | Azure Disk | disk.csi.azure.com | 3,000 (Premium SSD v2 base) | 80,000 (Premium SSD v2) | Sub-ms to low ms |
| GCP | Persistent Disk | pd.csi.storage.gke.io | 3,000 (pd-ssd) | 100,000 (pd-ssd, large disks) | Sub-ms to low ms |
Choose cloud block storage when:
- General-purpose database storage (PostgreSQL, MySQL, MongoDB)
- Single-pod workloads that need durable storage surviving node failure
- You need snapshots for backup and point-in-time recovery
- Standard performance requirements (not latency-critical microsecond workloads)
Important: Cloud block IOPS often scales with disk size. A 100GB gp3 volume delivers 3,000 IOPS; you must explicitly provision more if needed. A 1TB pd-ssd delivers more baseline IOPS than a 100GB pd-ssd. Over-provision disk size when you need more IOPS, or use provisioned IOPS tiers (io2, Ultra Disk, Hyperdisk Extreme).
Local SSD / Local Volumes#
Local volumes use disks physically attached to the node (NVMe SSDs, instance storage). They offer the lowest latency and highest IOPS but provide zero durability guarantee – if the node dies, the data is gone.
Choose local storage when:
- Maximum performance is critical: real-time caches (Redis with persistence), high-throughput databases (ScyllaDB, Cassandra where replication handles durability)
- Temporary high-IOPS scratch space for data processing pipelines
- You can tolerate data loss on node failure because the application handles replication (Cassandra, Elasticsearch, CockroachDB)
Avoid local storage when:
- Your workload is a single-instance database (PostgreSQL, MySQL) without application-level replication
- You cannot tolerate any data loss on node failure
- Pods must be rescheduled to different nodes during maintenance
Local volumes bind pods to specific nodes via node affinity. A pod using local storage cannot be rescheduled to a different node. This significantly impacts maintenance operations: draining a node with local-volume pods requires manual data migration or application-level rebalancing.
Use local volume type with a StorageClass that has volumeBindingMode: WaitForFirstConsumer to delay binding until the pod is scheduled:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-ssd
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: DeleteTopoLVM provides dynamic provisioning for local volumes, which is preferable to manually creating PersistentVolume objects for each local disk.
File Storage (ReadWriteMany)#
File storage provides a POSIX-compatible filesystem that multiple pods across multiple nodes can mount simultaneously (ReadWriteMany). This is essential for shared data scenarios.
Cloud File Storage#
| Provider | Service | CSI Driver | Throughput | Latency | Cost Model |
|---|---|---|---|---|---|
| AWS | EFS | efs.csi.aws.com | Scales with size (burst) | 1-10 ms | Per GB stored + throughput |
| Azure | Azure Files | file.csi.azure.com | Tier-dependent | 1-5 ms | Per GB provisioned |
| GCP | Filestore | filestore.csi.storage.gke.io | Tier-dependent | Sub-ms (Basic HDD) to sub-ms (Enterprise) | Per GB provisioned |
Choose cloud file storage when:
- Multiple pods need to read and write the same files (CMS content, shared configuration, machine learning training data)
- You need ReadWriteMany access mode without managing your own file server
- Moderate performance requirements (not database-level IOPS)
Critical warning: Do not use file storage (EFS, Azure Files, Filestore) for database workloads. The latency penalty compared to block storage is severe. A PostgreSQL instance on EFS will perform 10-100x worse than on EBS for write-heavy workloads. File storage is designed for throughput-oriented shared access, not random I/O.
NFS#
Traditional Network File System. Works on any infrastructure, well-understood, widely supported.
Choose NFS when:
- On-premises infrastructure with existing NFS servers
- Simple shared filesystem needs without cloud-specific dependencies
- You need ReadWriteMany without the cost of distributed storage like Ceph
Tradeoffs: NFS is a single point of failure unless you run an HA NFS setup (Pacemaker/Corosync, DRBD). Performance depends heavily on network bandwidth and the NFS server’s disk subsystem. NFSv4 with Kerberos adds authentication but increases complexity.
Rook-Ceph (CephFS + RBD)#
Rook deploys and manages Ceph on Kubernetes. Ceph provides block storage (RBD), file storage (CephFS), and object storage (RGW) from a single distributed system. Rook is a CNCF graduated project.
Choose Rook-Ceph when:
- On-premises or bare-metal environments needing both block and file storage
- You want software-defined storage that scales horizontally
- You need storage replication and self-healing without a cloud provider
- Your cluster has at least three nodes with dedicated disks for Ceph OSDs
Tradeoffs: Ceph is operationally complex. It requires dedicated disks (not shared with the OS), network bandwidth for replication, and monitoring for OSD health, PG states, and cluster balance. Minimum recommended deployment is three nodes with three OSDs each. Do not run Ceph on clusters with fewer than three nodes.
Object Storage#
Object storage (S3, GCS, Azure Blob) is not traditional block or file storage. It is accessed via HTTP APIs and is designed for large, unstructured data.
Choose object storage when:
- Storing backups, logs, and archives
- Machine learning training datasets
- Application assets (images, videos, documents)
- Any workload where objects are written once and read many times
Access object storage through application SDKs (AWS SDK, Google Cloud Client Libraries) rather than CSI drivers. CSI-based object mounts (Mountpoint for S3, GCS FUSE) provide POSIX-like access but with significant performance caveats: no random write support, high first-byte latency, and metadata operations (ls, stat) are slow.
Storage Class Design#
Create StorageClasses that map to use cases rather than implementation details. Application teams should request storage by what they need, not by which backend provides it:
# Fast SSD for databases
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "6000"
throughput: "250"
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain
allowVolumeExpansion: true
---
# Standard storage for general workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: ebs.csi.aws.com
parameters:
type: gp3
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
allowVolumeExpansion: true
---
# Shared filesystem for multi-pod access
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: shared-filesystem
provisioner: efs.csi.aws.comReclaim Policies#
Delete: PersistentVolume and its underlying storage are deleted when the PVC is deleted. Use for ephemeral or reproducible data (caches, build artifacts, test environments).
Retain: PersistentVolume is kept when the PVC is deleted. The data is preserved but the PV must be manually cleaned up or re-bound. Always use Retain for production databases. Losing a production database because someone deleted a PVC is a preventable disaster.
Set reclaim policy at the StorageClass level. Do not rely on application teams to set it correctly on individual PVCs.
StatefulSet Storage Patterns#
StatefulSets use volumeClaimTemplates to create one PVC per replica. Each replica gets its own stable, persistent storage:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100GiThis creates PVCs named data-postgres-0, data-postgres-1, data-postgres-2. Scaling down the StatefulSet does not delete the PVCs – they persist for when you scale back up. Deleting the StatefulSet also does not delete PVCs. This is a safety feature, not a bug.
Decision Summary#
| Workload | Storage Type | Access Mode | Key Consideration |
|---|---|---|---|
| PostgreSQL / MySQL (single) | Cloud block (gp3/Premium SSD/pd-ssd) | ReadWriteOnce | Use Retain reclaim, snapshot backups |
| Redis / cache with persistence | Local SSD or cloud block | ReadWriteOnce | Local for performance, block for durability |
| Cassandra / ScyllaDB / CockroachDB | Local SSD preferred | ReadWriteOnce | App-level replication handles durability |
| Elasticsearch | Local SSD or cloud block | ReadWriteOnce | Local for hot nodes, block for warm/cold |
| Shared CMS content | Cloud file (EFS/Azure Files) | ReadWriteMany | Latency acceptable for content serving |
| ML training data | Object storage (S3/GCS) | N/A (SDK access) | Use FUSE mount only if POSIX required |
| Kafka / event streaming | Cloud block or local SSD | ReadWriteOnce | Replication handles durability, IOPS matters |
| On-prem, block + file needed | Rook-Ceph | Both | Minimum 3 nodes, dedicated disks |
| On-prem, simple shared storage | NFS | ReadWriteMany | Single point of failure without HA setup |
Common Mistakes#
Using EFS/Azure Files for databases. File storage latency is 10-100x higher than block storage for random I/O. Databases need block storage. This is one of the most frequent Kubernetes storage anti-patterns.
Using block storage when shared access is needed. If multiple pods must write to the same filesystem, block storage (ReadWriteOnce) will not work. You need file storage (ReadWriteMany) or an application-level approach (object storage with SDK access).
Forgetting reclaim policy for production data. The default reclaim policy for dynamically provisioned volumes is Delete. One accidental kubectl delete pvc away from data loss. Set Retain on all production StorageClasses.
Under-provisioning cloud block storage IOPS. A 10GB gp3 volume delivers 3,000 IOPS. If your database needs 10,000 IOPS, either provision IOPS explicitly or provision a larger disk. Disk size and IOPS are coupled in most cloud block storage tiers.
Running Ceph on undersized clusters. Ceph needs dedicated disks, adequate memory for OSDs (1-2GB per OSD), and network bandwidth for replication. Running Ceph on three nodes with shared OS/data disks and 8GB RAM each will result in poor performance and data safety concerns.