Choosing a Database Strategy#
Every Kubernetes-based platform eventually faces two questions: should the database run inside the cluster or as a managed service, and which database engine fits the workload? These decisions are difficult to reverse. A database migration is one of the highest-risk operations in production. Getting the initial decision roughly right saves months of future pain.
Where to Run: Kubernetes vs Managed Service#
This is not a technology question. It is an organizational question about who owns database operations and what tradeoffs the team will accept.
Databases on Kubernetes (Operator-Managed)#
Running databases inside Kubernetes means deploying StatefulSets (usually via an operator) with persistent volumes, handling failover through the operator’s automation, and managing backups with tools like the operator’s built-in backup or Velero.
Choose databases on Kubernetes when:
- You run on-premises or hybrid infrastructure where managed services are unavailable
- You want consistent tooling – the same GitOps workflow that deploys your application also deploys and configures the database
- Portability matters – you want to avoid cloud provider lock-in for the data layer
- You have Kubernetes expertise on the team and are comfortable operating StatefulSets
- Cost is a primary concern – managed database services carry a significant premium (often 2-5x the compute cost of equivalent self-managed instances)
- You need database instances for development, staging, and testing environments where managed service costs add up quickly
Limitations: You own the operational burden. Upgrades, patching, failover testing, backup verification, and performance tuning are your responsibility. Storage performance depends on the underlying CSI driver and cloud provider disk types. The operator itself is a dependency – if it has bugs in failover logic, your database suffers.
Managed Services (RDS, Cloud SQL, Azure Database)#
Managed services delegate operations to the cloud provider. You get automated backups, point-in-time recovery, automated minor version patching, high availability with provider-managed failover, and SLA guarantees.
Choose managed services when:
- Your team lacks dedicated DBA expertise and cannot invest in learning database operations
- You need SLA guarantees backed by the provider’s contractual commitments
- Compliance requirements mandate managed services (some regulatory frameworks require provider-managed encryption, patching, and audit logging)
- Operational simplicity is more valuable than cost savings – the team’s time is better spent on application features
- You are single-cloud and lock-in is an acceptable tradeoff for reduced operational burden
- Performance requirements are well-served by the provider’s instance types (most managed services offer instances up to hundreds of GBs of RAM and provisioned IOPS)
Limitations: Higher cost (the management premium is real). Less control over configuration – some parameters are not exposed. Version upgrades happen on the provider’s schedule, which may lag behind upstream releases. Network latency between your Kubernetes pods and the managed database depends on VPC peering and network configuration.
Decision Factors Summary#
| Factor | Kubernetes (Self-Managed) | Managed Service |
|---|---|---|
| Operational burden | High – you manage everything | Low – provider manages HA, backups, patching |
| Cost | Lower compute cost, higher ops cost | Higher compute cost, lower ops cost |
| Team expertise required | Kubernetes + DBA skills | Basic database usage |
| Portability | High – runs anywhere with K8s | Low – provider-specific |
| SLA | You define and enforce | Provider-backed contractual SLA |
| Customization | Full control over configuration | Limited to exposed parameters |
| Dev/staging environments | Cheap to spin up in-cluster | Expensive – each instance costs money |
| Compliance | You prove compliance | Provider handles many controls |
Operators for Databases on Kubernetes#
If you choose the Kubernetes path, the operator you select matters as much as the database engine.
CloudNativePG is a PostgreSQL operator maintained by EDB. It manages PostgreSQL clusters as a single Cluster CRD. Features include automated failover (promoted replica becomes primary), continuous WAL archiving to object storage, declarative backup schedules, and rolling updates. It is the most actively developed PostgreSQL operator in the CNCF ecosystem (sandbox project).
Zalando Postgres Operator uses Patroni under the hood for HA management. It is mature and battle-tested at Zalando’s scale. The postgresql CRD defines clusters with replicas, volume sizes, and PostgreSQL parameters. It supports logical backups, connection pooling via PgBouncer sidecars, and clone-from-S3 for restoring clusters.
Percona Operators cover PostgreSQL, MySQL, and MongoDB with a consistent operational model across all three. If your platform runs multiple database engines, Percona’s unified approach to backups, monitoring, and TLS simplifies operations.
CockroachDB Operator is built by Cockroach Labs specifically for their database. It handles node provisioning, rolling upgrades, and decommissioning. Since CockroachDB’s distributed architecture is more complex than single-primary databases, the operator handles topology-aware scheduling and region configuration.
The tradeoff with any operator is that you gain automation but add a dependency layer. If the operator has a bug in its failover logic or upgrade procedure, debugging requires understanding both the database and the operator’s state machine.
Which Database Engine: PostgreSQL vs MySQL vs CockroachDB#
PostgreSQL#
PostgreSQL is the most versatile open-source relational database. Its feature set spans advanced SQL (CTEs, window functions, lateral joins), native JSON/JSONB support with indexing, a rich extension ecosystem (PostGIS for geospatial, pgvector for vector search, pg_partman for partitioning), and excellent standards compliance.
Choose PostgreSQL when:
- Your application has complex data models with relationships, constraints, and advanced query patterns
- You need JSON document storage alongside relational data (JSONB with GIN indexes)
- Extensions matter – PostGIS for location data, pgvector for AI/ML embedding search, TimescaleDB for time-series
- You want the broadest community and ecosystem support
- Your workload is a mix of reads and writes with complex queries
- You plan to use database features like row-level security, generated columns, or partitioning
MySQL#
MySQL prioritizes simplicity and read performance. Its InnoDB engine provides ACID compliance, and its replication model is straightforward to set up and operate. MySQL dominates in web application stacks, particularly PHP/WordPress ecosystems.
Choose MySQL when:
- Your workload is read-heavy with relatively simple queries (web applications, content management)
- You have existing MySQL expertise on the team
- Replication simplicity matters – MySQL’s binary log replication is well-understood and easy to monitor
- You are running a WordPress, Drupal, or other PHP-based application that assumes MySQL
- You want simpler operations – MySQL’s configuration surface area is smaller than PostgreSQL’s
- Your application uses a straightforward relational model without needing advanced SQL features
CockroachDB#
CockroachDB is a distributed SQL database designed for horizontal scaling and multi-region deployments. It provides strong consistency (serializable isolation) across a distributed cluster, automatic sharding, and survival of node/zone/region failures without manual intervention.
Choose CockroachDB when:
- You need horizontal write scaling beyond what a single PostgreSQL or MySQL primary can handle
- Multi-region active-active deployment is a requirement (data served from the closest region, writes replicated across regions)
- You need strong consistency guarantees across a distributed system (not eventual consistency)
- Survivability requirements demand automatic handling of node, zone, or region failures
- Your application can use PostgreSQL-compatible SQL (CockroachDB supports most PostgreSQL syntax and wire protocol)
Tradeoffs: CockroachDB requires a minimum of 3 nodes for production (consensus requires a majority). Resource requirements are significantly higher than single-node databases – each node needs substantial CPU and memory. Query performance for single-node workloads is slower than PostgreSQL because every operation has distributed overhead (consensus, range lookups). Not all PostgreSQL features are supported – some extensions, stored procedures, and data types are missing or behave differently.
Engine Comparison Table#
| Criteria | PostgreSQL | MySQL | CockroachDB |
|---|---|---|---|
| ACID compliance | Full | Full (InnoDB) | Full (serializable) |
| Horizontal write scaling | No (read replicas only) | No (read replicas only) | Yes (automatic sharding) |
| Replication model | Streaming replication (async/sync) | Binary log replication | Raft consensus (built-in) |
| JSON support | Excellent (JSONB, indexable) | Basic (JSON type, limited indexing) | Good (JSONB, PostgreSQL-compatible) |
| Extension ecosystem | Massive (PostGIS, pgvector, 100+) | Limited | None (built-in features only) |
| Community size | Very large | Very large | Growing |
| K8s operator maturity | High (CloudNativePG, Zalando) | Medium (Percona, Oracle) | High (official operator) |
| Managed service options | RDS, Cloud SQL, Azure, Neon, Supabase | RDS, Cloud SQL, Azure, PlanetScale | Cockroach Cloud |
| Minimum production nodes | 1 primary + 1 replica | 1 primary + 1 replica | 3 nodes |
| Resource efficiency | High | High | Lower (distributed overhead) |
| Wire protocol | PostgreSQL | MySQL | PostgreSQL-compatible |
Sizing Guidance#
Small workloads (under 100 GB data, under 1000 queries per second): PostgreSQL or MySQL on a single node with a streaming replica. At this scale, CockroachDB’s distributed overhead is pure waste. A managed service like RDS with a db.t3.medium or equivalent is often the most cost-effective choice. If running on Kubernetes, CloudNativePG or Percona operator with a 2-node cluster (primary + replica) handles this easily.
Medium workloads (100 GB to 1 TB, 1000-10000 QPS): PostgreSQL with connection pooling (PgBouncer), read replicas for read scaling, and partitioning for large tables. MySQL performs well here for read-heavy patterns. CockroachDB starts to make sense if write scaling is the bottleneck and read replicas alone are not sufficient. Managed services at this tier are cost-effective because operational complexity increases.
Large workloads (over 1 TB, over 10000 QPS): If writes are the bottleneck, CockroachDB’s horizontal scaling becomes its primary advantage. If reads are the bottleneck, PostgreSQL with many read replicas and aggressive caching (Redis) often works. At this scale, the managed service premium is dwarfed by the complexity savings – unless you have a dedicated database team.
Common Mistakes#
Running CockroachDB for a small application. A 3-node CockroachDB cluster for a simple CRUD application consumes 3x the resources of a PostgreSQL primary-replica pair and adds distributed query latency for no benefit. CockroachDB’s value proposition starts at the scale where single-node databases cannot handle the write throughput.
Running PostgreSQL when MySQL’s simpler replication would suffice. If your application does straightforward CRUD with no complex queries, no JSON, and no need for extensions, MySQL’s simpler operational model and replication are advantages, not limitations.
Choosing self-managed databases on Kubernetes without the team to support them. An operator automates the common case. The uncommon case – corrupted replication, split-brain recovery, performance degradation from storage issues – requires database expertise that the operator does not provide. If your team cannot debug a PostgreSQL replication lag or a CockroachDB range imbalance, a managed service is the safer choice.
Ignoring connection pooling. Both PostgreSQL and MySQL have connection limits. Kubernetes workloads with many pods (especially during rolling updates with surge) can exhaust connections. PgBouncer for PostgreSQL and ProxySQL for MySQL should be part of the architecture from the start, not added after connection limit errors appear in production.