Choosing Kubernetes Storage#

Storage decisions in Kubernetes are harder to change than almost any other architectural choice. Migrating data between storage backends in production involves downtime, risk, and careful planning. Understand the tradeoffs before provisioning your first PersistentVolumeClaim.

The decision comes down to five criteria: performance (IOPS and latency), durability (can you survive node failure), portability (can you move the workload), cost, and access mode (single pod or shared).

Storage Categories#

Block Storage (ReadWriteOnce)#

Block storage provides a raw disk attached to a single node. Only one pod on that node can mount it at a time (ReadWriteOnce). This is the most common storage type for databases, caches, and any workload that needs fast, consistent disk I/O.

Cloud Block Storage#

Cloud providers offer managed block devices that attach to VMs over the network. They are durable (replicated across availability zones), snapshottable, and resizable.

Provider	Service	CSI Driver	Typical IOPS (gp-class)	Max IOPS	Latency
AWS	EBS	ebs.csi.aws.com	3,000 (gp3)	16,000 (gp3), 256,000 (io2)	Sub-ms to low ms
Azure	Azure Disk	disk.csi.azure.com	3,000 (Premium SSD v2 base)	80,000 (Premium SSD v2)	Sub-ms to low ms
GCP	Persistent Disk	pd.csi.storage.gke.io	3,000 (pd-ssd)	100,000 (pd-ssd, large disks)	Sub-ms to low ms

Choose cloud block storage when:

General-purpose database storage (PostgreSQL, MySQL, MongoDB)
Single-pod workloads that need durable storage surviving node failure
You need snapshots for backup and point-in-time recovery
Standard performance requirements (not latency-critical microsecond workloads)

Important: Cloud block IOPS often scales with disk size. A 100GB gp3 volume delivers 3,000 IOPS; you must explicitly provision more if needed. A 1TB pd-ssd delivers more baseline IOPS than a 100GB pd-ssd. Over-provision disk size when you need more IOPS, or use provisioned IOPS tiers (io2, Ultra Disk, Hyperdisk Extreme).

Local SSD / Local Volumes#

Local volumes use disks physically attached to the node (NVMe SSDs, instance storage). They offer the lowest latency and highest IOPS but provide zero durability guarantee – if the node dies, the data is gone.

Choose local storage when:

Maximum performance is critical: real-time caches (Redis with persistence), high-throughput databases (ScyllaDB, Cassandra where replication handles durability)
Temporary high-IOPS scratch space for data processing pipelines
You can tolerate data loss on node failure because the application handles replication (Cassandra, Elasticsearch, CockroachDB)

Avoid local storage when:

Your workload is a single-instance database (PostgreSQL, MySQL) without application-level replication
You cannot tolerate any data loss on node failure
Pods must be rescheduled to different nodes during maintenance

Local volumes bind pods to specific nodes via node affinity. A pod using local storage cannot be rescheduled to a different node. This significantly impacts maintenance operations: draining a node with local-volume pods requires manual data migration or application-level rebalancing.

Use local volume type with a StorageClass that has volumeBindingMode: WaitForFirstConsumer to delay binding until the pod is scheduled:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-ssd
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

TopoLVM provides dynamic provisioning for local volumes, which is preferable to manually creating PersistentVolume objects for each local disk.

File Storage (ReadWriteMany)#

File storage provides a POSIX-compatible filesystem that multiple pods across multiple nodes can mount simultaneously (ReadWriteMany). This is essential for shared data scenarios.

Cloud File Storage#

Provider	Service	CSI Driver	Throughput	Latency	Cost Model
AWS	EFS	efs.csi.aws.com	Scales with size (burst)	1-10 ms	Per GB stored + throughput
Azure	Azure Files	file.csi.azure.com	Tier-dependent	1-5 ms	Per GB provisioned
GCP	Filestore	filestore.csi.storage.gke.io	Tier-dependent	Sub-ms (Basic HDD) to sub-ms (Enterprise)	Per GB provisioned

Choose cloud file storage when:

Multiple pods need to read and write the same files (CMS content, shared configuration, machine learning training data)
You need ReadWriteMany access mode without managing your own file server
Moderate performance requirements (not database-level IOPS)

Critical warning: Do not use file storage (EFS, Azure Files, Filestore) for database workloads. The latency penalty compared to block storage is severe. A PostgreSQL instance on EFS will perform 10-100x worse than on EBS for write-heavy workloads. File storage is designed for throughput-oriented shared access, not random I/O.

NFS#

Traditional Network File System. Works on any infrastructure, well-understood, widely supported.

Choose NFS when:

On-premises infrastructure with existing NFS servers
Simple shared filesystem needs without cloud-specific dependencies
You need ReadWriteMany without the cost of distributed storage like Ceph

Tradeoffs: NFS is a single point of failure unless you run an HA NFS setup (Pacemaker/Corosync, DRBD). Performance depends heavily on network bandwidth and the NFS server’s disk subsystem. NFSv4 with Kerberos adds authentication but increases complexity.

Rook-Ceph (CephFS + RBD)#

Rook deploys and manages Ceph on Kubernetes. Ceph provides block storage (RBD), file storage (CephFS), and object storage (RGW) from a single distributed system. Rook is a CNCF graduated project.

Choose Rook-Ceph when:

On-premises or bare-metal environments needing both block and file storage
You want software-defined storage that scales horizontally
You need storage replication and self-healing without a cloud provider
Your cluster has at least three nodes with dedicated disks for Ceph OSDs

Tradeoffs: Ceph is operationally complex. It requires dedicated disks (not shared with the OS), network bandwidth for replication, and monitoring for OSD health, PG states, and cluster balance. Minimum recommended deployment is three nodes with three OSDs each. Do not run Ceph on clusters with fewer than three nodes.

Object Storage#

Object storage (S3, GCS, Azure Blob) is not traditional block or file storage. It is accessed via HTTP APIs and is designed for large, unstructured data.

Choose object storage when:

Storing backups, logs, and archives
Machine learning training datasets
Application assets (images, videos, documents)
Any workload where objects are written once and read many times

Access object storage through application SDKs (AWS SDK, Google Cloud Client Libraries) rather than CSI drivers. CSI-based object mounts (Mountpoint for S3, GCS FUSE) provide POSIX-like access but with significant performance caveats: no random write support, high first-byte latency, and metadata operations (ls, stat) are slow.

Storage Class Design#

Create StorageClasses that map to use cases rather than implementation details. Application teams should request storage by what they need, not by which backend provides it:

# Fast SSD for databases
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "6000"
  throughput: "250"
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain
allowVolumeExpansion: true
---
# Standard storage for general workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
allowVolumeExpansion: true
---
# Shared filesystem for multi-pod access
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: shared-filesystem
provisioner: efs.csi.aws.com

Reclaim Policies#

Delete: PersistentVolume and its underlying storage are deleted when the PVC is deleted. Use for ephemeral or reproducible data (caches, build artifacts, test environments).

Retain: PersistentVolume is kept when the PVC is deleted. The data is preserved but the PV must be manually cleaned up or re-bound. Always use Retain for production databases. Losing a production database because someone deleted a PVC is a preventable disaster.

Set reclaim policy at the StorageClass level. Do not rely on application teams to set it correctly on individual PVCs.

StatefulSet Storage Patterns#

StatefulSets use volumeClaimTemplates to create one PVC per replica. Each replica gets its own stable, persistent storage:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:16
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi

This creates PVCs named data-postgres-0, data-postgres-1, data-postgres-2. Scaling down the StatefulSet does not delete the PVCs – they persist for when you scale back up. Deleting the StatefulSet also does not delete PVCs. This is a safety feature, not a bug.

Decision Summary#

Workload	Storage Type	Access Mode	Key Consideration
PostgreSQL / MySQL (single)	Cloud block (gp3/Premium SSD/pd-ssd)	ReadWriteOnce	Use Retain reclaim, snapshot backups
Redis / cache with persistence	Local SSD or cloud block	ReadWriteOnce	Local for performance, block for durability
Cassandra / ScyllaDB / CockroachDB	Local SSD preferred	ReadWriteOnce	App-level replication handles durability
Elasticsearch	Local SSD or cloud block	ReadWriteOnce	Local for hot nodes, block for warm/cold
Shared CMS content	Cloud file (EFS/Azure Files)	ReadWriteMany	Latency acceptable for content serving
ML training data	Object storage (S3/GCS)	N/A (SDK access)	Use FUSE mount only if POSIX required
Kafka / event streaming	Cloud block or local SSD	ReadWriteOnce	Replication handles durability, IOPS matters
On-prem, block + file needed	Rook-Ceph	Both	Minimum 3 nodes, dedicated disks
On-prem, simple shared storage	NFS	ReadWriteMany	Single point of failure without HA setup

Common Mistakes#

Using EFS/Azure Files for databases. File storage latency is 10-100x higher than block storage for random I/O. Databases need block storage. This is one of the most frequent Kubernetes storage anti-patterns.

Using block storage when shared access is needed. If multiple pods must write to the same filesystem, block storage (ReadWriteOnce) will not work. You need file storage (ReadWriteMany) or an application-level approach (object storage with SDK access).

Forgetting reclaim policy for production data. The default reclaim policy for dynamically provisioned volumes is Delete. One accidental kubectl delete pvc away from data loss. Set Retain on all production StorageClasses.

Under-provisioning cloud block storage IOPS. A 10GB gp3 volume delivers 3,000 IOPS. If your database needs 10,000 IOPS, either provision IOPS explicitly or provision a larger disk. Disk size and IOPS are coupled in most cloud block storage tiers.

Running Ceph on undersized clusters. Ceph needs dedicated disks, adequate memory for OSDs (1-2GB per OSD), and network bandwidth for replication. Running Ceph on three nodes with shared OS/data disks and 8GB RAM each will result in poor performance and data safety concerns.