Designing Internal Developer Platforms

What an Internal Developer Platform Actually Is#

An Internal Developer Platform (IDP) is the set of tools, workflows, and self-service capabilities that a platform team builds and maintains so application developers can ship code without filing tickets or waiting on other teams. It is not a single product. It is a curated layer on top of your existing infrastructure that abstracts complexity while preserving the ability to go deeper when needed.

The goal is not to hide everything. It is to hide the right things. A developer creating a new microservice should not need to know which Terraform module provisions the load balancer, but the team’s senior engineer should still be able to inspect and override that configuration when troubleshooting.

Core Capabilities#

Every IDP addresses the same set of developer needs, though the implementation varies dramatically by organization size and maturity.

Service Catalog. A registry of every service, its ownership, dependencies, API contracts, and operational metadata. Without a catalog, developers discover services through tribal knowledge and Slack messages. The catalog answers: what services exist, who owns them, what do they depend on, and are they healthy? Backstage’s software catalog, Port, and Cortex all provide this. You can also build a minimal version with a Git repository of YAML service definitions and a simple web UI.

Environment Management. Developers need environments to test in — ephemeral preview environments for pull requests, shared staging environments for integration testing, and production. The platform should make creating and destroying environments self-service. Common implementations: Kubernetes namespaces provisioned per PR via CI/CD pipelines, Terraform workspaces triggered by a chatbot, or Crossplane claims that provision entire stacks.

CI/CD Integration. The platform does not replace your CI/CD system — it standardizes how teams use it. Golden pipelines (reusable pipeline templates) ensure every service builds, tests, scans, and deploys the same way. Developers opt in to the golden pipeline and override only what they need. This is where most platform value concentrates early on.

Observability. Centralized dashboards, logging, and alerting configured automatically when a service is deployed. If a developer deploys a new service and has to manually set up Grafana dashboards, Prometheus scrape targets, and PagerDuty routing, the platform is not doing its job. The platform should inject sidecar containers, configure scrape annotations, create default dashboards, and wire up alerts — all from the service catalog metadata.

Security and Compliance. Policy enforcement that runs automatically: container image scanning in the pipeline, network policies applied by namespace, secrets injected from Vault, mTLS via service mesh. Developers should not need to configure these. They should not be able to skip them, either.

Documentation and Onboarding. A new developer should go from zero to deploying a change in their first week. The platform provides scaffolded project templates, getting-started guides, and TechDocs co-located with the service code.

Build vs Buy Decision#

The market for developer platform products has grown rapidly. Backstage (open source, CNCF), Port, Cortex, OpsLevel, and Configure8 all offer some combination of service catalog, scorecards, and workflow automation. The build vs buy decision hinges on three factors.

Customization depth. If your infrastructure is standard (cloud-hosted Kubernetes, GitHub, Terraform), a commercial product can deliver value in weeks. If your infrastructure is bespoke (on-prem, custom deployment tooling, unusual compliance requirements), you will spend as much time configuring the product as building your own. Backstage sits in the middle — it is open source and extensible, but its plugin ecosystem requires TypeScript development effort.

Team size and skill. Building a platform requires dedicated engineers. A team of two can maintain a Backstage instance. Building custom UIs and integrations from scratch requires four or more full-time. If you do not have that capacity, buy.

Integration breadth. Count the tools your platform integrates with: Git providers, CI, cloud providers, monitoring, secret managers, registries, ticketing. Commercial products offer pre-built integrations. Building each from scratch is weeks of work.

Decision heuristic:

Under 50 developers, standard stack: commercial product or managed Backstage.
50-200 developers, some custom tooling: deploy Backstage, invest in custom plugins.
200+ developers, significant custom infrastructure: build a custom platform layer, potentially with Backstage as UI.

Platform Maturity Model#

Platform capabilities develop in stages. Trying to build everything at once fails. This model describes a practical progression.

Level 0 — Tribal Knowledge. No platform. Developers ask in Slack how to deploy. Runbooks live in someone’s head. Every team’s CI/CD pipeline is unique. Onboarding takes weeks.

Level 1 — Standardized Tooling. The platform team selects and maintains a standard toolchain: one CI system, one container registry, one deployment target. Documentation exists. Pipelines are still per-team but follow a recommended pattern. This level is achievable in a quarter with two engineers.

Level 2 — Golden Paths. Reusable templates for the most common workflows: create a new service, deploy to production, provision a database. Developers follow the golden path for 80% of cases. The remaining 20% still require manual work, and that is acceptable. A service catalog exists, even if it is a spreadsheet.

Level 3 — Self-Service Platform. Developers provision infrastructure, create services, and manage environments through a portal or CLI without platform team involvement. The service catalog is live and authoritative. Guardrails enforce policy automatically. The platform team shifts from doing work to maintaining the system that does the work.

Level 4 — Measured and Optimized. The platform team tracks developer experience metrics (DORA, onboarding time, developer satisfaction). Capabilities are added or improved based on data, not intuition. Feedback loops exist — developers report friction, the platform team addresses it in prioritized sprints.

Most organizations should target Level 2 within six months and Level 3 within eighteen months. Level 4 is an ongoing practice, not a destination.

Golden Paths#

A golden path is the opinionated, supported, and well-documented way to accomplish a common task. It is not the only way — developers can diverge — but diverging means taking on the maintenance burden themselves.

Effective golden paths share these properties:

Discoverable. Developers find them without asking. They are in the service catalog, the docs site, or the CLI help output.
Complete. A golden path for creating a service includes the repo scaffold, CI pipeline, deployment config, monitoring setup, and documentation template. A partial golden path that handles CI but not deployment is not a golden path — it is a fragment.
Escapable. Developers can eject from any part of the golden path without breaking the rest.
Maintained. A golden path created eighteen months ago and never updated is a trap. Budget time for maintenance.

Start with the three golden paths that cover the most developer time: creating a new service from a template, deploying a change to production, and debugging a production incident.

Guardrails, Not Gates#

Guardrails prevent unsafe actions automatically without requiring approval workflows. Gates require someone to approve before work can proceed. Guardrails scale; gates do not.

Examples of guardrails:

OPA/Gatekeeper policies that reject Kubernetes manifests running as root.
CI pipeline steps that fail the build if container images have critical CVEs.
Terraform Sentinel policies that prevent provisioning resources outside approved regions.
Network policies that deny all ingress by default and require explicit allowlisting.

Examples of gates (use sparingly):

Production deployment approval for regulated environments.
Budget approval for resources exceeding a cost threshold.
Security review for services handling PII.

The platform team’s job is to convert gates into guardrails wherever possible. If every deployment requires manual security review, automate the security checks and only flag the exceptions for human review.

When to Invest in Each Capability#

Not every organization needs every capability immediately. Prioritize based on where developers lose the most time.

Developer Pain Point	Platform Capability	Priority
“I don’t know what services exist”	Service catalog	High — foundational
“Setting up CI/CD takes days”	Golden pipelines	High — immediate ROI
“I can’t get a test environment”	Environment management	High if env bottleneck exists
“I don’t know how to deploy”	Documentation, templates	High — low effort
“Security review blocks every release”	Automated guardrails	Medium — needs policy buy-in
“I can’t find the right dashboard”	Auto-configured observability	Medium — depends on scale
“Provisioning a database takes a week”	Self-service infrastructure	Medium — high effort to build

Start with the service catalog and golden pipelines. These two capabilities provide the highest return for the lowest investment and create the foundation that other capabilities build on.