Multi-Cluster Kubernetes#
A single Kubernetes cluster is a single blast radius. A bad deployment, a control plane failure, a misconfigured admission webhook – any of these can take down everything. Multi-cluster is not about complexity for its own sake. It is about isolation, resilience, and operating workloads that span regions, regulations, or teams.
Why Multi-Cluster#
Blast radius isolation. A cluster-wide failure (etcd corruption, bad admission webhook, API server overload) only affects one cluster. Critical workloads in another cluster are untouched.
Regulatory and geographic requirements. Data residency laws require workloads to run in specific regions. A cluster per region keeps data local while still sharing a management plane.
Team autonomy. Large organizations give teams their own clusters to avoid noisy-neighbor problems, RBAC complexity, and resource contention. Each team has full control over their cluster’s lifecycle.
Hybrid cloud. Some workloads must run on-premise (data gravity, hardware requirements) while others run in the cloud. Multi-cluster bridges both environments.
Architecture Patterns#
Hub and Spoke#
A management cluster (hub) controls workload clusters (spokes). The hub runs the management plane – ArgoCD, monitoring, policy engines. The spokes run application workloads.
+------------------+
| Hub Cluster |
| ArgoCD, Prom, |
| Policy Engine |
+--------+---------+
/ | \
/ | \
+------+--+ +---+-----+ +-+--------+
| Spoke A | | Spoke B | | Spoke C |
| us-east | | eu-west | | ap-south |
+----------+ +---------+ +----------+The hub is the single pane of glass. It deploys applications to spokes, collects metrics from spokes, and enforces policy on spokes. If the hub goes down, the spokes continue running their workloads – they just cannot receive new deployments.
Tools for hub and spoke: ArgoCD (ApplicationSets with cluster generator), Rancher (full cluster lifecycle management), or Cluster API (cluster provisioning as CRDs).
Active-Active#
The same workloads deploy to multiple clusters. A global load balancer distributes traffic across all of them. If one cluster fails, the others absorb the traffic.
+-------------------+
| Global LB / DNS |
+----+----+----+----+
| | |
+-----+ +--+--+ +-----+
| CL-1 | | CL-2 | | CL-3 |
| App | | App | | App |
+------+ +------+ +------+This requires: identical deployments across clusters, shared or replicated data stores, and health checks at the global load balancer level. Cloud providers offer this natively (AWS Global Accelerator, Azure Front Door, GCP Global LB).
Active-Passive#
A primary cluster handles all traffic. A standby cluster receives replicated data and can take over if the primary fails. Simpler than active-active but has a recovery time gap during failover.
Mesh (Peer-to-Peer)#
All clusters are peers. Services in any cluster can discover and communicate with services in any other cluster. No central management plane. Tools like Liqo and Admiralty enable this pattern, effectively creating a virtual cluster that spans multiple physical clusters.
Cluster API (CAPI)#
Cluster API treats clusters as Kubernetes resources. You define a cluster in YAML, apply it, and CAPI creates the infrastructure and bootstraps Kubernetes on it. Clusters become cattle, not pets.
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: workload-us-east
namespace: clusters
spec:
clusterNetwork:
pods:
cidrBlocks: ["192.168.0.0/16"]
services:
cidrBlocks: ["10.96.0.0/12"]
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: workload-us-east-control-plane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
name: workload-us-eastCAPI components:
- Bootstrap providers (kubeadm, EKS, Talos): how nodes are initialized
- Infrastructure providers (AWS, Azure, vSphere, Metal3): where nodes run
- Control plane providers: how the control plane is managed
Machine, MachineDeployment, and MachineSet CRDs mirror the Deployment/ReplicaSet/Pod hierarchy but for nodes. Rolling updates to node pools work the same way as rolling updates to pods.
# Install Cluster API with the AWS provider
clusterctl init --infrastructure aws
# Create a workload cluster
kubectl apply -f workload-cluster.yaml
# Get the kubeconfig for the new cluster
clusterctl get kubeconfig workload-us-east > workload-us-east.kubeconfigMulti-Cluster Networking#
The fundamental challenge: pods in Cluster A need to reach pods in Cluster B.
Flat Networking#
Ensure pod CIDRs do not overlap across clusters, then connect the underlying networks:
- VPC Peering (AWS, GCP): direct network link between VPCs. Low latency, no encryption overhead.
- Transit Gateway (AWS): hub-and-spoke VPC connectivity. Simpler than full mesh peering.
- VPN tunnels: for hybrid or cross-cloud connectivity.
Pods can address each other by IP, but there is no service discovery. You need another layer for that.
Service Mesh Federation#
Istio and Linkerd both support multi-cluster, extending service discovery and mTLS across cluster boundaries.
Istio multi-cluster has two models:
# Primary-Remote: one cluster runs the Istio control plane, others connect to it
istioctl install --set values.global.meshID=production \
--set values.global.multiCluster.clusterName=cluster-1 \
--set values.global.network=network-1
# Multi-Primary: each cluster runs its own control plane, synchronized via shared CABoth models give you transparent cross-cluster service routing. A service in Cluster A can call a service in Cluster B using the same <service>.<namespace>.svc.cluster.local DNS name.
Submariner#
Submariner is a CNCF project that provides Layer 3 connectivity across clusters. It creates encrypted tunnels between clusters and extends service discovery.
# Install the Submariner broker on the hub cluster
subctl deploy-broker --kubeconfig hub.kubeconfig
# Join workload clusters to the broker
subctl join --kubeconfig cluster-a.kubeconfig broker-info.subm
subctl join --kubeconfig cluster-b.kubeconfig broker-info.submAfter joining, pods in Cluster A can reach services in Cluster B via <service>.<namespace>.svc.clusterset.local.
Kubernetes MCS API#
The Multi-Cluster Services API is the Kubernetes-native approach to cross-cluster service discovery:
# In the exporting cluster: make a service available to other clusters
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceExport
metadata:
name: payments-api
namespace: payments
---
# In the importing cluster: consume the exported service
apiVersion: multicluster.x-k8s.io/v1alpha1
kind: ServiceImport
metadata:
name: payments-api
namespace: payments
spec:
type: ClusterSetIP
ports:
- port: 8080
protocol: TCPThe imported service is reachable at payments-api.payments.svc.clusterset.local.
Configuration Management with GitOps#
ArgoCD ApplicationSets#
ApplicationSets generate ArgoCD Applications dynamically – one per cluster, one per environment, or any combination.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: platform-services
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
environment: production
template:
metadata:
name: 'platform-{{name}}'
spec:
project: platform
source:
repoURL: https://github.com/org/platform-config
targetRevision: main
path: 'clusters/{{metadata.labels.region}}/platform'
destination:
server: '{{server}}'
namespace: platform
syncPolicy:
automated:
prune: true
selfHeal: trueThis generates an Application for every cluster labeled environment: production, pulling configuration from a path based on the cluster’s region label. Add a new cluster with the right label, and it gets the platform services automatically.
Flux Multi-Cluster#
Flux uses Kustomization resources to target different clusters. A common pattern is a monorepo with overlays per cluster:
clusters/
us-east/
kustomization.yaml # References base + us-east patches
eu-west/
kustomization.yaml # References base + eu-west patches
base/
platform/
cert-manager.yaml
ingress-nginx.yaml
monitoring.yamlEach cluster’s Flux instance watches its own directory and applies only its configuration.
Config Drift#
The silent killer of multi-cluster setups is drift. Clusters that started identical slowly diverge: someone applied a manual change, a Helm upgrade failed on one cluster but succeeded on others, or different clusters upgraded at different times.
Prevention:
- GitOps with selfHeal/prune ensures manual changes are reverted
- Policy engines (Kyverno, OPA Gatekeeper) enforce consistency across clusters
- Periodic audits: compare running state across clusters and alert on differences
Observability Across Clusters#
Metrics#
Each cluster runs Prometheus. Metrics are forwarded to a central store using one of:
- Remote write to Thanos, Mimir, or VictoriaMetrics
- Prometheus federation (pull model, simpler but less scalable)
Add a cluster label to all metrics so you can filter and aggregate by cluster in Grafana:
# In Prometheus config or via kube-prometheus-stack values
prometheus:
prometheusSpec:
externalLabels:
cluster: us-east-prod
remoteWrite:
- url: https://mimir.monitoring.svc:9090/api/v1/pushIn Grafana, create a cluster template variable sourced from label_values(up, cluster) so dashboards work across all clusters with a single selector.
Logging#
Forward logs from all clusters to a central system (Elasticsearch, Loki, Splunk) with a cluster label. Use Fluentd, Fluent Bit, or the OpenTelemetry Collector with a k8s.cluster.name attribute on every log line.
Tracing#
Distributed traces that span clusters need propagation context (W3C TraceContext headers) to flow across cluster boundaries. The trace collector in each cluster forwards to a central backend (Jaeger, Tempo) with cluster attribution.
Common Gotchas#
Certificate management: Each cluster has its own CA. Services that do mTLS across clusters need a shared root CA or mutual trust configuration. Istio handles this with a shared root CA across the mesh. Without a service mesh, you need to manage cross-cluster TLS manually.
CIDR overlap: If two clusters use the same pod CIDR (default 10.244.0.0/16 for many CNI plugins), direct networking between them is impossible. Plan CIDRs before creating clusters.
GitOps drift: Without strict enforcement (automated sync, selfHeal, prune), clusters diverge. One emergency kubectl apply on one cluster creates a difference that compounds over time. Treat manual changes as incidents.
Cost: Multiple clusters mean multiple control planes. On managed Kubernetes, each control plane has a cost (EKS charges per cluster). Cross-cluster networking (VPC peering, transit gateway, NAT) adds data transfer costs. Budget for the operational overhead of managing more clusters.
Operational complexity: Multi-cluster is not free. You need tooling to manage cluster lifecycle, deploy across clusters, aggregate observability, and handle failover. Start with two clusters and grow incrementally. Do not build a five-cluster mesh on day one.