Why the Operating Model Matters#

The platform team’s operating model determines whether the platform becomes a force multiplier or a bottleneck. A ticket-driven, gatekeeper-oriented team produces a platform developers route around. A product-oriented, self-service team produces a platform developers adopt voluntarily. Organizational structure shapes developer experience more than technology choices.

Team Topologies and Interaction Modes#

The Team Topologies framework (Skelton & Pais) defines four team types relevant to platform engineering:

Stream-aligned teams build and operate business capabilities. They are the platform team’s customers. Their cognitive load determines what the platform must abstract. If they spend 30% of their time on infrastructure plumbing, the platform is not doing its job.

Platform teams provide self-service capabilities that reduce cognitive load. The key constraint: platform teams enable, not block. If a stream-aligned team needs platform team involvement to ship, the interaction model is wrong.

Enabling teams temporarily help stream-aligned teams adopt new capabilities, then step back once the capability is self-service.

Complicated-subsystem teams own components requiring deep specialist knowledge — ML inference platforms, custom database engines, compliance frameworks — and expose capabilities through platform-managed interfaces.

Three interaction modes:

  • X-as-a-Service: Primary mode. Platform provides self-service via APIs and templates. No coordination needed.
  • Collaboration: Temporary. Platform and stream-aligned teams co-design new abstractions.
  • Facilitating: Platform teaches teams to use existing capabilities, then disengages.

Thin Platform vs Thick Platform#

A thin platform wraps existing tools (Kubernetes, Terraform, CI/CD) with opinionated defaults and golden paths. The platform team curates and integrates — it does not build infrastructure software. Team size: 3-8 engineers. Capabilities ship in weeks. Custom code is limited to integration glue and developer-facing UIs.

A thick platform builds custom control planes, proprietary deployment systems, or bespoke provisioning engines. Team size: 15-50+ engineers across sub-teams. Multi-quarter roadmap. Appropriate only at large scale (thousands of developers) where off-the-shelf tools genuinely cannot meet requirements.

Most organizations build thick when thin would suffice. A 200-person engineering org does not need a custom deployment platform. Kubernetes + ArgoCD + Backstage with a 5-person platform team covers 95% of needs.

API-First Platform Design#

Every platform capability should be exposed as an API before it gets a UI. This means:

  • Infrastructure provisioning is a Crossplane Claim or Terraform module — not a form that triggers a human workflow.
  • Service registration is a YAML file in a repository — not a manual catalog entry.
  • Permissions and access are declarative policies — not ticket-driven RBAC changes.

The API-first approach ensures that every platform capability is automatable. Backstage provides a UI layer, but the underlying operations must work without it. If the only way to create a database is through a Backstage form, you have built a portal, not a platform.

Concrete example of a platform API surface:

Platform API Surface
---------------------
Compute:    Kubernetes namespace provisioning via GitOps (Namespace + ResourceQuota + NetworkPolicy)
Storage:    Crossplane Claims for RDS, ElastiCache, S3
Networking: Ingress definitions via Helm values, DNS via ExternalDNS
CI/CD:      Reusable GitHub Actions workflows, ArgoCD ApplicationSets
Secrets:    ExternalSecrets operator, Vault policies via Terraform
Observability: Pre-configured Grafana dashboards via Jsonnet, alerting rules via PrometheusRule CRDs

Each of these is a declarative API. Developers interact by committing YAML or invoking a template. No tickets, no approvals for standard operations.

Platform as a Product#

Treating the platform as an internal product means:

Product management: Someone owns the roadmap and prioritizes based on developer feedback. Not always a formal PM — a senior platform engineer can fill this role — but someone must own it.

User research: Regular conversations with stream-aligned teams. Shadow developers during onboarding. Watch where they struggle.

Roadmap transparency: Publish what is being worked on and what is coming. A GitHub project board showing the platform roadmap reduces “when will you support X?” questions by 80%.

Deprecation policy: Give teams migration timelines with concrete support. “We are deprecating Jenkins. ArgoCD is the replacement. Migration guide is here. Jenkins decommissions March 1. We will pair with any team needing help during February.” Abrupt shutdowns destroy trust.

Internal SLOs for Platform Services#

Platform teams should publish SLOs for their services, just like production SLOs for customer-facing systems:

Platform Service SLO Measurement
CI pipeline (GitHub Actions) 95% of builds complete in < 10 min P95 build duration over 28 days
ArgoCD sync 99.5% of syncs succeed on first attempt Sync success rate over 7 days
New service scaffolding Available 99.9% of the time Backstage scaffolder uptime
Namespace provisioning Provisioned within 5 minutes of merge Time from PR merge to namespace ready
Secret injection 99.9% availability ExternalSecrets sync success rate

SLOs hold the platform team accountable and give stream-aligned teams confidence. “Will my deploy work?” is answered by the SLO dashboard, not by pinging the platform team on Slack. Measure with the same tools you use for production: Prometheus, Grafana, burn rate alerting.

Team Sizing#

Rules of thumb for platform team sizing:

  • 50-150 developers: 3-5 platform engineers. Focus on golden paths, CI/CD, and basic self-service. Thin platform only.
  • 150-500 developers: 5-12 platform engineers. Add dedicated work on observability platform, security guardrails, and developer portal. Still mostly thin.
  • 500-2000 developers: 12-30 platform engineers, split into sub-teams (compute, data, developer experience, security). Some custom tooling warranted.
  • 2000+ developers: 30+ platform engineers across multiple teams. Thick platform investments may be justified.

The signal your team is too small: stream-aligned teams build their own platform capabilities because yours cannot keep up. The signal it is too large: you are building custom solutions for problems open-source tools solve well.