Platform Engineering Maturity Model

Why a Maturity Model#

Platform engineering investments fail when organizations skip levels. A team that cannot maintain shared Terraform modules reliably has no business building a self-service portal. The maturity model provides an honest assessment of where you are and what must be true before advancing.

This is not a five-year roadmap. Some organizations reach Level 2 and stay there — it serves their needs. The model helps you identify what level you need, what level you are at, and what is blocking progress.

Level 0: Ad-Hoc Scripts#

Every team provisions their own infrastructure. Bash scripts, manual console clicks, copy-pasted Terraform from Stack Overflow, hand-edited Kubernetes manifests. No standardization, no shared tooling.

Indicators:

Infrastructure provisioning requires SSH access or cloud console clicks.
Each team has its own CI/CD setup, often configured by whoever set it up originally.
No central inventory of running services — discovery happens through tribal knowledge.
Deployments involve runbooks with 20+ manual steps.
Time-to-deploy for a new service is measured in weeks.

Assessment criteria:

Count the number of distinct CI/CD configurations across teams. If every team has a unique pipeline structure, you are at Level 0.
Ask three developers how to provision a database. If you get three different answers, you are at Level 0.

What triggers progression: A production outage caused by inconsistent infrastructure, or a new hire taking three weeks to deploy their first change. Pain is the trigger — executive mandate alone does not work.

Level 1: Shared Tooling#

The platform team (often just one or two people) creates reusable components: Terraform modules, Helm chart libraries, CI/CD pipeline templates, container base images. Teams consume these voluntarily.

Indicators:

Centralized Terraform module registry with versioned modules for common resources.
Shared CI/CD pipeline templates (GitHub Actions reusable workflows, Jenkins shared libraries).
Base container images maintained by the platform team.
A wiki page or README listing available shared tooling.

Example shared module structure:

platform-modules/
  terraform/
    rds-instance/        # Versioned, opinionated RDS module
    s3-bucket/           # Standard bucket with encryption defaults
    vpc/                 # Networking with sensible defaults
  helm-charts/
    service-template/    # Standard microservice chart
  github-actions/
    ci-pipeline/         # Reusable build-test-scan-deploy workflow

Assessment criteria:

What percentage of teams use the shared modules? Below 40% means adoption is optional in practice, not just policy.
Are shared modules versioned and documented? If developers cannot find or understand them, they do not exist.

Common stall: The platform team ships modules but never markets them. Adoption stays at 30% because developers do not know the modules exist or do not trust them. The fix is embedding with one team, proving the value, and letting word spread.

What triggers progression: Ticket volume for infrastructure requests overwhelms the platform team. The shared modules reduce complexity but the request model (file a ticket, wait for someone) does not scale.

Level 2: Self-Service Portal#

Developers provision infrastructure and create services without filing tickets. A portal or CLI backed by automation handles provisioning within guardrails the platform team defines.

Indicators:

Backstage, Port, or a custom portal where developers create services from templates.
Crossplane claims or Terraform modules triggered by a Git commit or portal action.
Automated environment provisioning — PR environments spin up without human intervention.
Service catalog with ownership, dependencies, and operational metadata.
Policy enforcement via OPA, Kyverno, or Sentinel — not manual review.

Example self-service flow:

Developer selects "New PostgreSQL Database" in portal
  -> Fills in: name, size (small/medium/large), environment
  -> Portal creates a PR with Crossplane claim YAML
  -> CI validates against OPA policies
  -> Merge triggers ArgoCD sync
  -> Database provisioned, credentials injected into Vault
  -> Developer receives connection string in 10 minutes

Assessment criteria:

What percentage of infrastructure requests go through self-service vs tickets? Target: above 80%.
What is the median provisioning time? If self-service takes longer than a ticket, developers will not use it.
How many guardrail violations are caught automatically vs discovered in production?

Common stall: The portal covers only the easy cases. Databases and caches are self-service, but networking changes, IAM policies, and cross-account access still require tickets. Developers learn which requests are fast and which are slow, and their experience is inconsistent. The fix is prioritizing the long-tail requests that still require human intervention.

What triggers progression: The platform team realizes governance is reactive. Scorecards reveal half the services are non-compliant with security standards but nobody notices until audit time.

Level 3: Automated Governance#

Compliance and operational standards are enforced continuously, not checked periodically. The platform tracks service health, security posture, and architectural compliance in real time.

Indicators:

Scorecards in Port or Backstage that rate every service on production readiness: has runbook, has alerts, has up-to-date dependencies, passes security scan.
Automated remediation: a drift detection system that opens PRs to fix non-compliant configurations.
Cost allocation and showback — every team sees what their services cost.
SLO-based alerting configured automatically from service catalog metadata.
Dependency graph visualization — the platform knows what talks to what.

Example scorecard definition (Port):

{
  "title": "Production Readiness",
  "rules": [
    {"title": "Has on-call rotation", "query": ".properties.pagerduty_service != null"},
    {"title": "Has runbook", "query": ".properties.runbook_url != null"},
    {"title": "Container image < 30 days old", "query": ".properties.image_age_days < 30"},
    {"title": "No critical CVEs", "query": ".properties.critical_cves == 0"},
    {"title": "Test coverage > 70%", "query": ".properties.test_coverage > 70"}
  ]
}

Assessment criteria:

Can you answer “which services are not production-ready?” in under 30 seconds?
When a new compliance requirement emerges (encrypt all S3 buckets, rotate secrets every 90 days), how long does it take to assess current state and enforce the requirement?
Are governance checks blocking deployments or just reporting?

Common stall: Governance becomes punitive. The scorecard turns into a checklist that developers resent. The fix is making compliance the path of least resistance — templates that are compliant by default, automated fixes for common violations, and clear explanations of why each rule exists.

Level 4: Predictive Optimization#

The platform anticipates needs based on usage patterns, cost trends, and historical data. It recommends actions rather than waiting for requests.

Indicators:

Automatic right-sizing recommendations based on actual resource usage.
Predictive scaling — provisioning capacity before demand spikes based on historical patterns.
Proactive dependency updates — the platform opens PRs to update libraries before CVEs are published.
Cost anomaly detection with automatic investigation.
Capacity planning models that project infrastructure needs quarterly.

Assessment criteria:

Does the platform surface recommendations, or do teams discover problems themselves?
Is capacity planning data-driven or guesswork?
Can the platform predict the impact of an architecture change before it ships?

Common stall: Few organizations reach Level 4 genuinely. Many claim to by adding a cost dashboard and calling it predictive. True Level 4 requires significant data infrastructure and ML capability. Most organizations should not target Level 4 — Level 3 with good feedback loops serves all but the largest engineering organizations.

Progression Heuristics#

Signal	You Need to Move To
New hire cannot deploy in first week	Level 1
Platform team is a ticket bottleneck	Level 2
Audit findings surprise you	Level 3
Cost overruns are discovered quarterly, not daily	Level 4

Assess quarterly. Be honest about where you are — claiming Level 2 while half your infrastructure is provisioned through tickets helps nobody.