Multi-Cloud vs Single-Cloud Strategy#

Multi-cloud is one of the most oversold strategies in infrastructure. Vendors, consultants, and conference speakers promote it as the default approach, but the reality is that most organizations are better served by a single cloud provider used well. This framework helps you determine whether multi-cloud is actually worth the cost for your situation.

The Default Answer Is Single-Cloud#

Start with single-cloud unless you have a specific, concrete reason to go multi-cloud. Here is why.

Operational complexity doubles. Every cloud provider has different networking models, IAM systems, storage APIs, monitoring integrations, and failure modes. Running production workloads across two clouds means your team needs deep expertise in both. This is not twice the work – it is often three to four times the work because of the integration layer between them.

Cost optimization requires specialization. Reserved instances, savings plans, spot pricing, committed use discounts – each provider has different mechanisms. Getting the best pricing requires dedicated effort per provider. Splitting workloads across two clouds means you cannot commit to volume discounts on either, often resulting in higher total cost than a single-provider strategy.

Managed services are the value. The real value of cloud providers is not VMs – it is managed databases, queuing systems, ML platforms, and operational tooling. Going multi-cloud means either using the lowest common denominator (self-managed everything) or maintaining two implementations of every service dependency.

When Multi-Cloud Is Actually Worth It#

There are legitimate reasons to run workloads across multiple clouds. Each of these is a concrete business requirement, not an abstract architectural preference.

Regulatory or Data Sovereignty Requirements#

Some industries and geographies require data to reside in specific regions where your primary cloud provider does not operate, or where a specific provider has certification that others lack. Government workloads may require GovCloud (AWS) while commercial workloads run on standard Azure. Healthcare data in certain countries may need to stay on a provider with specific compliance certifications.

Decision test: Do you have a written regulatory requirement that a single cloud provider cannot satisfy? If the answer is “no” or “we are not sure,” this is not your reason.

Acquisition or Merger Integration#

When companies merge, they often bring different cloud providers. Migrating everything to one provider is a multi-year project. Running both is a pragmatic choice during the transition, but it should be treated as a temporary state with a migration plan, not a permanent architecture.

Specific Best-of-Breed Services#

Some providers genuinely lead in specific areas. GCP has superior ML/AI tooling (Vertex AI, BigQuery ML). AWS has the broadest service catalog and best spot market. Azure has the tightest Microsoft ecosystem integration (Active Directory, Office 365). If your core product depends on a capability that is clearly superior on a different provider than your primary one, running that specific workload on the other provider can make sense.

Decision test: Is the capability difference significant enough to justify the operational overhead? A 10% improvement in ML training speed probably is not. Native Active Directory integration for 50,000 employees probably is.

Disaster Recovery Across Provider Failures#

True multi-cloud DR – where your application fails over from one cloud to another – is extremely expensive and rarely tested well enough to actually work. Full-provider outages (not regional, but the entire provider) are exceedingly rare. AWS has had two total global outages in its history. You are far more likely to experience a regional or AZ failure, which is handled by multi-region within a single provider.

Decision test: Have you already implemented and tested multi-region failover within your primary provider? If not, do that first. It delivers 95% of the DR benefit at 20% of the multi-cloud cost.

Vendor Lock-In Assessment#

Vendor lock-in is the most commonly cited reason for multi-cloud, and the most commonly misunderstood. Assess lock-in by categorizing your dependencies.

Low Lock-In (Easy to Migrate)#

  • Compute (VMs, containers): VMs and container workloads move between providers with moderate effort. Kubernetes workloads are particularly portable.
  • Block and object storage: S3 API compatibility is widespread. Object storage is straightforward to migrate (time-consuming, not complex).
  • DNS and CDN: Standard protocols, easy to switch.
  • Container registries: OCI images are provider-agnostic.

Medium Lock-In (Significant Migration Effort)#

  • Managed Kubernetes (EKS, GKE, AKS): The Kubernetes API is portable, but networking (VPC CNI, load balancers), storage classes, IAM integration (IRSA, Workload Identity), and ingress controllers are provider-specific.
  • Managed databases (RDS, Cloud SQL): The database engine is portable (PostgreSQL is PostgreSQL), but connection handling, backup systems, read replica configuration, and extensions differ.
  • Networking (VPCs, firewalls, load balancers): Concepts are similar but implementations and APIs differ completely. Terraform helps but does not eliminate the migration effort.

High Lock-In (Difficult and Expensive to Migrate)#

  • Serverless compute (Lambda, Cloud Functions): Tight integration with event sources, IAM, and provider-specific APIs. Rewriting is required, not just reconfiguration.
  • Proprietary databases (DynamoDB, Cosmos DB, Spanner): Data model and query patterns are provider-specific. Migration means redesigning your data layer.
  • ML/AI services (SageMaker, Vertex AI): Training pipelines, model registries, and inference endpoints are deeply integrated.
  • Event and messaging services (EventBridge, Pub/Sub): Event schemas, routing rules, and integrations are non-portable.

The Lock-In Decision#

Lock-in is acceptable when the managed service saves significant engineering time. Running self-managed Kafka to avoid Kinesis lock-in costs more in engineer hours than any future migration would. The question is not “will we ever need to migrate?” but “does the probability and cost of future migration exceed the current savings from using the managed service?”

For most teams: use managed services freely. If you later need to migrate, the migration cost is a known, bounded project. The ongoing cost of self-managing everything to avoid hypothetical lock-in is unbounded.

Kubernetes as an Abstraction Layer#

Kubernetes is often pitched as the multi-cloud abstraction layer. This is partially true.

What Kubernetes abstracts well:

  • Container scheduling and orchestration
  • Service discovery and internal networking
  • Configuration management (ConfigMaps, Secrets)
  • Deployment strategies (rolling updates, canaries)
  • Horizontal pod autoscaling

What Kubernetes does not abstract:

  • External load balancers and ingress (provider-specific annotations and controllers)
  • Persistent storage (storage classes, CSI drivers differ per provider)
  • IAM and access control (IRSA vs Workload Identity vs AAD Pod Identity)
  • Networking (VPC CNI, Calico, Azure CNI all behave differently)
  • Cluster provisioning and management (EKS, GKE, AKS APIs are completely different)

If you use Kubernetes, you can move your application workloads between providers with moderate effort. You cannot move your infrastructure-as-code, networking, or operational tooling. Plan for rewriting 30-50% of your infrastructure code during a migration, even with Kubernetes.

Making Kubernetes More Portable#

If portability is genuinely important:

  • Use Terraform or Crossplane for cluster provisioning with provider-specific modules. You will rewrite the provider modules, but the pattern stays the same.
  • Avoid provider-specific annotations on Kubernetes resources when possible. Use the Gateway API instead of provider-specific ingress annotations.
  • Abstract storage behind a CSI driver interface. Use generic storage class names (fast-ssd, standard) and map them to provider-specific classes.
  • Use external-secrets-operator instead of provider-specific secret injection. It supports AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, and HashiCorp Vault through a single Kubernetes API.

Cost Analysis Framework#

When evaluating multi-cloud costs, include these often-overlooked factors:

Cost Category Single-Cloud Multi-Cloud
Compute and storage Optimized with commitments Higher (split commitments)
Data transfer Intra-cloud (cheap) Cross-cloud egress ($$$)
Engineering headcount One platform team 1.5-2x platform team
Training and certification One provider Multiple providers
Tooling and automation One set Duplicated per provider
Incident response One runbook set Multiple runbook sets
Compliance auditing One provider’s tools Multiple audit frameworks

Cross-cloud data transfer is the hidden killer. Cloud providers charge $0.01-0.09/GB for egress. If services on AWS need to talk to services on GCP, you are paying egress on every request. For data-intensive workloads, this alone can make multi-cloud uneconomical.

Decision Flowchart#

Follow this sequence to determine your strategy:

  1. Do you have a regulatory requirement that mandates multiple providers? If yes, multi-cloud is required. Focus on minimizing the cross-cloud surface area.

  2. Are you integrating acquired companies on different clouds? If yes, run multi-cloud temporarily with a migration plan. Set a timeline (12-24 months) to consolidate.

  3. Do you need a specific best-of-breed service on another provider? If yes, run that specific workload on the other provider. Keep everything else on your primary provider. This is not “multi-cloud strategy” – it is pragmatic use of a specific service.

  4. Are you concerned about vendor lock-in? Assess your actual lock-in level using the categories above. If most of your dependencies are low or medium lock-in, the migration cost is manageable and does not justify the ongoing multi-cloud overhead.

  5. Are you concerned about provider-wide outages? Implement multi-region within your primary provider first. If you have done that and still need more resilience, multi-cloud DR may be justified – but only if you are willing to invest in regular cross-cloud failover testing.

  6. None of the above? Single-cloud. Invest the engineering effort you would have spent on multi-cloud into better observability, security, and cost optimization on your primary provider.

Practical Recommendation for Agents#

When an agent is evaluating infrastructure architecture or responding to multi-cloud questions:

  • Default to recommending single-cloud unless a specific requirement from the list above is identified.
  • If multi-cloud is required, recommend minimizing the cross-cloud surface area: keep services that communicate frequently on the same provider.
  • Always flag cross-cloud data transfer costs in architecture reviews.
  • Recommend Kubernetes for workload portability, but be explicit that infrastructure code will still need rework during any migration.
  • Prefer Terraform with provider-specific modules over custom abstraction layers. Abstractions that try to hide cloud differences usually leak and add complexity without real portability.