Why a Maturity Model#

Platform engineering investments fail when organizations skip levels. A team that cannot maintain shared Terraform modules reliably has no business building a self-service portal. The maturity model provides an honest assessment of where you are and what must be true before advancing.

This is not a five-year roadmap. Some organizations reach Level 2 and stay there — it serves their needs. The model helps you identify what level you need, what level you are at, and what is blocking progress.

Level 0: Ad-Hoc Scripts#

Every team provisions their own infrastructure. Bash scripts, manual console clicks, copy-pasted Terraform from Stack Overflow, hand-edited Kubernetes manifests. No standardization, no shared tooling.

Indicators:

  • Infrastructure provisioning requires SSH access or cloud console clicks.
  • Each team has its own CI/CD setup, often configured by whoever set it up originally.
  • No central inventory of running services — discovery happens through tribal knowledge.
  • Deployments involve runbooks with 20+ manual steps.
  • Time-to-deploy for a new service is measured in weeks.

Assessment criteria:

  • Count the number of distinct CI/CD configurations across teams. If every team has a unique pipeline structure, you are at Level 0.
  • Ask three developers how to provision a database. If you get three different answers, you are at Level 0.

What triggers progression: A production outage caused by inconsistent infrastructure, or a new hire taking three weeks to deploy their first change. Pain is the trigger — executive mandate alone does not work.

Level 1: Shared Tooling#

The platform team (often just one or two people) creates reusable components: Terraform modules, Helm chart libraries, CI/CD pipeline templates, container base images. Teams consume these voluntarily.

Indicators:

  • Centralized Terraform module registry with versioned modules for common resources.
  • Shared CI/CD pipeline templates (GitHub Actions reusable workflows, Jenkins shared libraries).
  • Base container images maintained by the platform team.
  • A wiki page or README listing available shared tooling.

Example shared module structure:

platform-modules/
  terraform/
    rds-instance/        # Versioned, opinionated RDS module
    s3-bucket/           # Standard bucket with encryption defaults
    vpc/                 # Networking with sensible defaults
  helm-charts/
    service-template/    # Standard microservice chart
  github-actions/
    ci-pipeline/         # Reusable build-test-scan-deploy workflow

Assessment criteria:

  • What percentage of teams use the shared modules? Below 40% means adoption is optional in practice, not just policy.
  • Are shared modules versioned and documented? If developers cannot find or understand them, they do not exist.

Common stall: The platform team ships modules but never markets them. Adoption stays at 30% because developers do not know the modules exist or do not trust them. The fix is embedding with one team, proving the value, and letting word spread.

What triggers progression: Ticket volume for infrastructure requests overwhelms the platform team. The shared modules reduce complexity but the request model (file a ticket, wait for someone) does not scale.

Level 2: Self-Service Portal#

Developers provision infrastructure and create services without filing tickets. A portal or CLI backed by automation handles provisioning within guardrails the platform team defines.

Indicators:

  • Backstage, Port, or a custom portal where developers create services from templates.
  • Crossplane claims or Terraform modules triggered by a Git commit or portal action.
  • Automated environment provisioning — PR environments spin up without human intervention.
  • Service catalog with ownership, dependencies, and operational metadata.
  • Policy enforcement via OPA, Kyverno, or Sentinel — not manual review.

Example self-service flow:

Developer selects "New PostgreSQL Database" in portal
  -> Fills in: name, size (small/medium/large), environment
  -> Portal creates a PR with Crossplane claim YAML
  -> CI validates against OPA policies
  -> Merge triggers ArgoCD sync
  -> Database provisioned, credentials injected into Vault
  -> Developer receives connection string in 10 minutes

Assessment criteria:

  • What percentage of infrastructure requests go through self-service vs tickets? Target: above 80%.
  • What is the median provisioning time? If self-service takes longer than a ticket, developers will not use it.
  • How many guardrail violations are caught automatically vs discovered in production?

Common stall: The portal covers only the easy cases. Databases and caches are self-service, but networking changes, IAM policies, and cross-account access still require tickets. Developers learn which requests are fast and which are slow, and their experience is inconsistent. The fix is prioritizing the long-tail requests that still require human intervention.

What triggers progression: The platform team realizes governance is reactive. Scorecards reveal half the services are non-compliant with security standards but nobody notices until audit time.

Level 3: Automated Governance#

Compliance and operational standards are enforced continuously, not checked periodically. The platform tracks service health, security posture, and architectural compliance in real time.

Indicators:

  • Scorecards in Port or Backstage that rate every service on production readiness: has runbook, has alerts, has up-to-date dependencies, passes security scan.
  • Automated remediation: a drift detection system that opens PRs to fix non-compliant configurations.
  • Cost allocation and showback — every team sees what their services cost.
  • SLO-based alerting configured automatically from service catalog metadata.
  • Dependency graph visualization — the platform knows what talks to what.

Example scorecard definition (Port):

{
  "title": "Production Readiness",
  "rules": [
    {"title": "Has on-call rotation", "query": ".properties.pagerduty_service != null"},
    {"title": "Has runbook", "query": ".properties.runbook_url != null"},
    {"title": "Container image < 30 days old", "query": ".properties.image_age_days < 30"},
    {"title": "No critical CVEs", "query": ".properties.critical_cves == 0"},
    {"title": "Test coverage > 70%", "query": ".properties.test_coverage > 70"}
  ]
}

Assessment criteria:

  • Can you answer “which services are not production-ready?” in under 30 seconds?
  • When a new compliance requirement emerges (encrypt all S3 buckets, rotate secrets every 90 days), how long does it take to assess current state and enforce the requirement?
  • Are governance checks blocking deployments or just reporting?

Common stall: Governance becomes punitive. The scorecard turns into a checklist that developers resent. The fix is making compliance the path of least resistance — templates that are compliant by default, automated fixes for common violations, and clear explanations of why each rule exists.

Level 4: Predictive Optimization#

The platform anticipates needs based on usage patterns, cost trends, and historical data. It recommends actions rather than waiting for requests.

Indicators:

  • Automatic right-sizing recommendations based on actual resource usage.
  • Predictive scaling — provisioning capacity before demand spikes based on historical patterns.
  • Proactive dependency updates — the platform opens PRs to update libraries before CVEs are published.
  • Cost anomaly detection with automatic investigation.
  • Capacity planning models that project infrastructure needs quarterly.

Assessment criteria:

  • Does the platform surface recommendations, or do teams discover problems themselves?
  • Is capacity planning data-driven or guesswork?
  • Can the platform predict the impact of an architecture change before it ships?

Common stall: Few organizations reach Level 4 genuinely. Many claim to by adding a cost dashboard and calling it predictive. True Level 4 requires significant data infrastructure and ML capability. Most organizations should not target Level 4 — Level 3 with good feedback loops serves all but the largest engineering organizations.

Progression Heuristics#

Signal You Need to Move To
New hire cannot deploy in first week Level 1
Platform team is a ticket bottleneck Level 2
Audit findings surprise you Level 3
Cost overruns are discovered quarterly, not daily Level 4

Assess quarterly. Be honest about where you are — claiming Level 2 while half your infrastructure is provisioned through tickets helps nobody.