Why a Maturity Model#
Platform engineering investments fail when organizations skip levels. A team that cannot maintain shared Terraform modules reliably has no business building a self-service portal. The maturity model provides an honest assessment of where you are and what must be true before advancing.
This is not a five-year roadmap. Some organizations reach Level 2 and stay there — it serves their needs. The model helps you identify what level you need, what level you are at, and what is blocking progress.
Level 0: Ad-Hoc Scripts#
Every team provisions their own infrastructure. Bash scripts, manual console clicks, copy-pasted Terraform from Stack Overflow, hand-edited Kubernetes manifests. No standardization, no shared tooling.
Indicators:
- Infrastructure provisioning requires SSH access or cloud console clicks.
- Each team has its own CI/CD setup, often configured by whoever set it up originally.
- No central inventory of running services — discovery happens through tribal knowledge.
- Deployments involve runbooks with 20+ manual steps.
- Time-to-deploy for a new service is measured in weeks.
Assessment criteria:
- Count the number of distinct CI/CD configurations across teams. If every team has a unique pipeline structure, you are at Level 0.
- Ask three developers how to provision a database. If you get three different answers, you are at Level 0.
What triggers progression: A production outage caused by inconsistent infrastructure, or a new hire taking three weeks to deploy their first change. Pain is the trigger — executive mandate alone does not work.
Level 1: Shared Tooling#
The platform team (often just one or two people) creates reusable components: Terraform modules, Helm chart libraries, CI/CD pipeline templates, container base images. Teams consume these voluntarily.
Indicators:
- Centralized Terraform module registry with versioned modules for common resources.
- Shared CI/CD pipeline templates (GitHub Actions reusable workflows, Jenkins shared libraries).
- Base container images maintained by the platform team.
- A wiki page or README listing available shared tooling.
Example shared module structure:
platform-modules/
terraform/
rds-instance/ # Versioned, opinionated RDS module
s3-bucket/ # Standard bucket with encryption defaults
vpc/ # Networking with sensible defaults
helm-charts/
service-template/ # Standard microservice chart
github-actions/
ci-pipeline/ # Reusable build-test-scan-deploy workflowAssessment criteria:
- What percentage of teams use the shared modules? Below 40% means adoption is optional in practice, not just policy.
- Are shared modules versioned and documented? If developers cannot find or understand them, they do not exist.
Common stall: The platform team ships modules but never markets them. Adoption stays at 30% because developers do not know the modules exist or do not trust them. The fix is embedding with one team, proving the value, and letting word spread.
What triggers progression: Ticket volume for infrastructure requests overwhelms the platform team. The shared modules reduce complexity but the request model (file a ticket, wait for someone) does not scale.
Level 2: Self-Service Portal#
Developers provision infrastructure and create services without filing tickets. A portal or CLI backed by automation handles provisioning within guardrails the platform team defines.
Indicators:
- Backstage, Port, or a custom portal where developers create services from templates.
- Crossplane claims or Terraform modules triggered by a Git commit or portal action.
- Automated environment provisioning — PR environments spin up without human intervention.
- Service catalog with ownership, dependencies, and operational metadata.
- Policy enforcement via OPA, Kyverno, or Sentinel — not manual review.
Example self-service flow:
Developer selects "New PostgreSQL Database" in portal
-> Fills in: name, size (small/medium/large), environment
-> Portal creates a PR with Crossplane claim YAML
-> CI validates against OPA policies
-> Merge triggers ArgoCD sync
-> Database provisioned, credentials injected into Vault
-> Developer receives connection string in 10 minutesAssessment criteria:
- What percentage of infrastructure requests go through self-service vs tickets? Target: above 80%.
- What is the median provisioning time? If self-service takes longer than a ticket, developers will not use it.
- How many guardrail violations are caught automatically vs discovered in production?
Common stall: The portal covers only the easy cases. Databases and caches are self-service, but networking changes, IAM policies, and cross-account access still require tickets. Developers learn which requests are fast and which are slow, and their experience is inconsistent. The fix is prioritizing the long-tail requests that still require human intervention.
What triggers progression: The platform team realizes governance is reactive. Scorecards reveal half the services are non-compliant with security standards but nobody notices until audit time.
Level 3: Automated Governance#
Compliance and operational standards are enforced continuously, not checked periodically. The platform tracks service health, security posture, and architectural compliance in real time.
Indicators:
- Scorecards in Port or Backstage that rate every service on production readiness: has runbook, has alerts, has up-to-date dependencies, passes security scan.
- Automated remediation: a drift detection system that opens PRs to fix non-compliant configurations.
- Cost allocation and showback — every team sees what their services cost.
- SLO-based alerting configured automatically from service catalog metadata.
- Dependency graph visualization — the platform knows what talks to what.
Example scorecard definition (Port):
{
"title": "Production Readiness",
"rules": [
{"title": "Has on-call rotation", "query": ".properties.pagerduty_service != null"},
{"title": "Has runbook", "query": ".properties.runbook_url != null"},
{"title": "Container image < 30 days old", "query": ".properties.image_age_days < 30"},
{"title": "No critical CVEs", "query": ".properties.critical_cves == 0"},
{"title": "Test coverage > 70%", "query": ".properties.test_coverage > 70"}
]
}Assessment criteria:
- Can you answer “which services are not production-ready?” in under 30 seconds?
- When a new compliance requirement emerges (encrypt all S3 buckets, rotate secrets every 90 days), how long does it take to assess current state and enforce the requirement?
- Are governance checks blocking deployments or just reporting?
Common stall: Governance becomes punitive. The scorecard turns into a checklist that developers resent. The fix is making compliance the path of least resistance — templates that are compliant by default, automated fixes for common violations, and clear explanations of why each rule exists.
Level 4: Predictive Optimization#
The platform anticipates needs based on usage patterns, cost trends, and historical data. It recommends actions rather than waiting for requests.
Indicators:
- Automatic right-sizing recommendations based on actual resource usage.
- Predictive scaling — provisioning capacity before demand spikes based on historical patterns.
- Proactive dependency updates — the platform opens PRs to update libraries before CVEs are published.
- Cost anomaly detection with automatic investigation.
- Capacity planning models that project infrastructure needs quarterly.
Assessment criteria:
- Does the platform surface recommendations, or do teams discover problems themselves?
- Is capacity planning data-driven or guesswork?
- Can the platform predict the impact of an architecture change before it ships?
Common stall: Few organizations reach Level 4 genuinely. Many claim to by adding a cost dashboard and calling it predictive. True Level 4 requires significant data infrastructure and ML capability. Most organizations should not target Level 4 — Level 3 with good feedback loops serves all but the largest engineering organizations.
Progression Heuristics#
| Signal | You Need to Move To |
|---|---|
| New hire cannot deploy in first week | Level 1 |
| Platform team is a ticket bottleneck | Level 2 |
| Audit findings surprise you | Level 3 |
| Cost overruns are discovered quarterly, not daily | Level 4 |
Assess quarterly. Be honest about where you are — claiming Level 2 while half your infrastructure is provisioned through tickets helps nobody.