Change Management for Infrastructure

Why Change Management Matters#

Most production incidents trace back to a change. Code deployments, configuration updates, infrastructure modifications, database migrations – each introduces risk. Change management reduces that risk through structure, visibility, and accountability. The goal is not to prevent change but to make change safe, visible, and reversible.

Change Request Process#

Every infrastructure change flows through a structured request. The formality scales with risk, but the basic elements remain constant.

change_request:
  id: "CR-2026-0451"
  title: "Upgrade Redis cluster from 7.0 to 7.2"
  requester: "platform-team"
  type: "standard"          # standard | normal | emergency
  risk_level: "medium"      # low | medium | high | critical
  environment: "production"
  services_affected:
    - "redis-cluster-primary"
    - "session-cache"
  implementation_plan: |
    1. Update Helm values for Redis chart version
    2. Apply to staging, run integration tests
    3. Apply to production using rolling update strategy
    4. Verify cluster health and replication status
  rollback_plan: |
    1. Revert Helm values to previous chart version
    2. Apply rollback with helm rollback redis-cluster
    3. Verify cluster re-forms with old version
  change_window: "2026-02-25 02:00-04:00 UTC"
  approvers: ["oncall-sre", "redis-team-lead"]
  estimated_duration: "45 minutes"

Standard changes are pre-approved, low-risk, repeatable operations (scaling a deployment, rotating a certificate). They follow a template and do not require per-instance approval.

Normal changes require review and approval. Most infrastructure changes fall here – full request, review, approve, schedule, execute workflow.

Emergency changes bypass normal approval when production is down or at immediate risk. They still require documentation after the fact, and every emergency change triggers a retrospective.

Risk Assessment Framework#

Score every change along four dimensions from 1 (low) to 4 (critical):

Dimension	1 - Low	2 - Medium	3 - High	4 - Critical
Blast radius	Single pod/service	Multiple services	Entire namespace	Cross-cluster or data layer
Reversibility	Instant rollback	Rollback under 5 min	Rollback requires downtime	Irreversible (data migration)
Confidence	Done 50+ times	Done several times	First time, tested in staging	First time, no staging equivalent
Timing	Off-peak	Normal hours	Business hours, moderate traffic	Peak traffic

Composite risk score (sum of four scores):

4-6: Low risk. Standard change process. Single approver.
7-10: Medium risk. Two approvers. Monitoring dashboards open during execution.
11-13: High risk. Extended review. Dedicated rollback operator. Incident channel open.
14-16: Critical risk. Director-level approval. Full war-room staffing.

Rollback Criteria#

Define rollback triggers before executing a change. These are pre-committed conditions, not judgment calls during an incident.

rollback_triggers:
  immediate:
    - "Error rate exceeds 5% for 2 consecutive minutes"
    - "P99 latency exceeds 3x baseline for 3 minutes"
    - "Health check failures on more than 20% of pods"
  time_bounded:
    - condition: "Error rate above 1%"
      duration: "15 minutes after deployment completes"
  manual_review:
    - "Any unexpected log pattern not seen in staging"
    - "Dependent service reports degradation"

Decide when to rollback before you start, not during the stress of a failure. If the conditions are met, roll back without debate.

Rollback execution: announce in the change channel, execute the pre-documented procedure, verify metrics that triggered it have recovered, update change request status, and create a follow-up investigation ticket.

Change Windows#

Change windows constrain risk to periods of lower impact and higher staffing.

Production change windows:
  Standard:  Any time (automated, pre-approved)
  Normal:    Tuesday-Thursday, 02:00-06:00 UTC
  Emergency: Any time (with incident commander approval)

Exclusions:
  - No normal changes within 48 hours of a major release
  - No normal changes during peak traffic
  - No changes on company holidays
  - Respect change freeze periods

An agent executing changes should programmatically verify the window:

def can_execute_change(change_request, current_time):
    if change_request.type == "emergency":
        return has_incident_commander_approval(change_request)
    if is_change_freeze_active(current_time):
        return False
    if change_request.type == "standard":
        return True
    if not is_within_change_window(current_time):
        return False
    return has_required_approvals(change_request)

Progressive Rollouts#

Never deploy a change to 100% of production simultaneously. Progressive rollouts catch problems when the blast radius is still small.

Step 1 – Canary (1-5%): Deploy to a single pod or small traffic slice. Monitor for 15-30 minutes. Step 2 – Limited (10-25%): Expand to a quarter of the fleet. Monitor for 15-30 minutes. Step 3 – Broad (50%): Half the fleet runs the new version. Monitor for 15 minutes. Step 4 – Full (100%): Complete the deployment. Monitor for the bake period (1-2 hours).

For fine-grained control, use Argo Rollouts:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 15m }
        - setWeight: 25
        - pause: { duration: 15m }
        - setWeight: 50
        - pause: { duration: 15m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1

Change Freeze Policies#

Change freezes prohibit non-emergency changes during high-risk periods: end-of-quarter, major product launches, holiday periods with reduced staffing, or after a major incident until the post-mortem is complete.

change_freeze:
  name: "Q1 2026 End-of-Quarter"
  start: "2026-03-28T00:00:00Z"
  end: "2026-04-02T00:00:00Z"
  scope: "all-production"
  exceptions:
    - "Security patches rated Critical or High"
    - "Fixes for active production incidents"
  enforcement: "automated"
  approver_for_exceptions: "vp-engineering"

Enforce freezes in your deployment pipeline. The deployment tool should query the freeze schedule and block non-exempt changes automatically.

Step-by-Step Change Execution Workflow#

Submit change request. Fill out all fields including implementation plan, rollback plan, and rollback criteria.
Risk assessment. Score the change across four dimensions to determine the approval path.
Peer review. At least one engineer reviews the plan and rollback procedure.
Approval. Required approvers sign off based on risk level.
Schedule. Assign to an approved change window. Notify affected teams.
Pre-flight checks. Verify monitoring works, rollback tooling is accessible, no freeze is active. Take backups if applicable.
Announce. Post in the change channel with CR ID, description, ETA, and rollback criteria reference.
Execute. Follow the implementation plan step by step. Verify expected outcomes before proceeding.
Monitor. Watch dashboards for the bake period. Compare against baseline metrics.
Validate. Run smoke tests or synthetic checks against the changed system.
Close. Update the change request to completed. Post a summary with before/after metrics.

If rollback criteria are triggered during steps 8-10, execute the rollback plan immediately.

Agent Operational Notes#

Never skip the rollback plan. Every change request must have one before execution begins.
Log every action. Humans must be able to reconstruct exactly what the agent did and when.
Respect approval gates. An agent should not self-approve its own changes.
Fail safe. If monitoring data is unavailable during a rollout, treat it as a rollback trigger.
Communicate proactively. Post status updates at each stage. Silence during a change is alarming.