Terraform Safety for Agents#

Terraform is the most dangerous tool most agents have access to. A single terraform apply can create, modify, or destroy real infrastructure — databases with production data, networking that carries live traffic, security groups that protect running services. There is no undo button. terraform destroy is not an undo — it is a different destructive action.

This article defines the safety protocols agents must follow when working with Terraform: what to check before every plan, how to read plan output for danger, how to present plans to humans, when to apply vs when to stop, and how to handle state conflicts.

The Cardinal Rules#

Never run terraform apply without human approval. Not even for “safe” changes. The human reviews the plan output and explicitly approves.
Never run terraform apply without a saved plan file. Always plan -out=tfplan then apply tfplan. Never apply with no arguments — it recomputes the plan and might produce different results.
Never run terraform destroy autonomously. Destruction requires explicit human instruction.
Never force-unlock a state file without investigating who holds the lock.
Never use -target in automation. It is a debugging tool for humans, not a workflow shortcut.
Always run plan before any apply. Read the plan. Understand the plan. Present the plan. Then wait.

The Safe Workflow#

Agent receives task: "Add a new subnet to the VPC"

1. Read the current Terraform files (understand what exists)
2. Read the state file listing (terraform state list — understand what is managed)
3. Write the change (add the subnet resource block)
4. Run terraform plan -out=tfplan
5. Read the plan output carefully
6. Classify the risk level
7. Present the plan summary to the human
8. WAIT for approval
9. On approval: terraform apply tfplan
10. Verify: terraform plan (should show "No changes")
11. Report result to human

Steps 4-8 are the safety gate. The agent does not skip from step 3 to step 9.

Reading Plan Output for Danger#

Terraform plan output uses symbols to indicate what will happen:

+   create          (new resource, lowest risk)
~   update in-place (modify existing, moderate risk)
-/+ destroy and recreate (replaces resource, HIGH risk for stateful resources)
-   destroy         (deletes resource, HIGH risk)
<=  read            (data source refresh, no risk)

Danger Signals in Plan Output#

Signal	What It Means	Risk Level	Agent Action
Any `-` (destroy)	A resource will be deleted	High	Always flag to human. Never auto-approve.
Any `-/+` (replace)	A resource will be destroyed and recreated	High	Flag to human. Stateful resources (databases, volumes) lose data on replace.
`forces replacement` in change detail	An attribute change requires destroying and recreating	High	Identify which attribute forced the replacement and flag it.
Changes to `aws_security_group` rules	Network access control is changing	Moderate	Summarize what ports/CIDRs are being added or removed.
Changes to IAM policies or roles	Permissions are changing	Moderate-High	Summarize what permissions are being granted or revoked.
`~ tags` only	Only tags are changing	Low	Mention but do not flag as dangerous.
More resources changing than expected	The change should affect 1 resource but 5 are changing	Moderate	Investigate why. Possible state drift or unexpected dependency.
`(known after apply)` on critical attributes	Terraform cannot predict the value until apply	Low (usually normal)	Note it but do not flag unless the attribute is security-sensitive.

Interpreting Replace Operations#

Replace (-/+) is the most dangerous common operation. It means Terraform must destroy the old resource before creating the new one. For stateless resources (security groups, IAM roles), this is usually fine. For stateful resources, it means data loss:

Resources that lose data on replace:
  - aws_db_instance (database — all data lost)
  - aws_rds_cluster (database — all data lost)
  - aws_ebs_volume (disk — all data lost unless snapshot exists)
  - aws_elasticache_cluster (cache — all data lost)
  - aws_s3_bucket (bucket — name reclaimed, objects may be lost)
  - aws_efs_file_system (filesystem — all data lost)
  - aws_dynamodb_table (table — all data lost unless backup exists)
  - kubernetes_persistent_volume_claim (volume — data lost)

Resources safe to replace:
  - aws_security_group (recreated, rules re-applied)
  - aws_iam_role (recreated, policies re-attached)
  - aws_launch_template (version incremented)
  - aws_instance (if stateless — ephemeral compute)
  - kubernetes_deployment (pods restart with new config)

When you see -/+ on a stateful resource, stop and escalate. Tell the human which resource is being replaced, why (which attribute forced the replacement), and what data would be lost.

Presenting Plans to Humans#

The human does not want to read 200 lines of Terraform plan output. Summarize it:

The Plan Summary Template#

Terraform plan summary:

  Create: 2 resources
    + aws_subnet.private_c (new subnet in us-east-1c)
    + aws_route_table_association.private_c

  Modify: 1 resource
    ~ aws_security_group.eks_nodes (adding ingress rule for new subnet CIDR)

  Destroy: 0 resources

  Risk: Low
    - All changes are additive (new subnet + route table association)
    - Security group change adds a rule (does not remove existing rules)
    - No stateful resources affected

  Estimated impact: New subnet available for EKS node scheduling in us-east-1c.
  Reversible: Yes — remove the subnet resource and apply.

  Approve apply? [The full plan output is available if you want to review details]

What the Summary Must Always Include#

Resource counts by action type (create/modify/destroy)
Named resources (not just counts — “aws_db_instance.main” not “1 resource”)
Risk level with reasoning
Any destroys or replaces highlighted explicitly
Impact statement in plain language
Reversibility assessment

Escalation Triggers#

Present the plan and recommend not applying when:

Any stateful resource is being replaced
More resources are changing than the task requires
Security group rules are being removed (not added)
IAM permissions are being broadened significantly
The plan shows drift-related changes mixed with intentional changes
Resources in a different environment or region appear in the plan

I am flagging this plan for review before applying.

The plan shows 3 resources changing, but the task only required adding
a subnet (1-2 resources). The additional change is:

  ~ aws_rds_cluster.main
    - engine_version: "15.4" → "15.6"

This database version change was not part of the requested work. It may
be caused by a provider update or a default value change. Applying it
would trigger a database maintenance window.

Options:
1. Apply only the subnet changes (use lifecycle ignore_changes temporarily)
2. Investigate the engine version change before applying anything
3. Apply everything if the version upgrade is desired

I recommend option 2.

State Lock Protocols#

When You Encounter a Lock#

Error: Error locking state: Error acquiring the state lock
Lock Info:
  ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
  Operation: OperationTypeApply
  Who:       runner@github-actions
  Created:   2026-02-22 14:32:01 UTC

Agent protocol:

Read the lock info. Who holds it? When was it created? What operation?
Check if the operation is still running. If “Who” is a CI runner, check the CI pipeline. If it is a user, ask the human.
Wait before escalating. Use -lock-timeout=5m on your next plan to wait for the lock to release naturally.
Never force-unlock autonomously. Force-unlocking a lock held by an active operation corrupts the state.
Escalate to the human with the lock info and your assessment of whether the holding operation is still active.

The state file is locked by a CI pipeline run (runner@github-actions)
that started 45 minutes ago. CI jobs in this repo typically complete
in 10-15 minutes.

This lock may be stale (from a crashed or timed-out job).

Options:
1. Wait — retry with -lock-timeout=5m in case it releases
2. Check the CI dashboard to confirm the job is dead
3. Force-unlock (only if confirmed the job is no longer running)

I will try option 1 first. If it does not release, we should check CI
before force-unlocking.

After a Failed Apply#

If terraform apply fails partway through (some resources created, some not):

Do not panic. Terraform state reflects what actually happened — resources that were created are in state, resources that failed are not.
Run terraform plan. It will show the remaining changes needed to reach the desired state.
Investigate the failure. Read the error message. Common causes: rate limits, permission denied, resource conflicts.
Fix the cause and re-plan. Do not blindly re-apply. The fix might change the plan.
Report to human what was partially applied, what failed, and the recovery plan.

Drift Investigation#

When terraform plan shows unexpected changes — resources being modified that you did not change in code — this is drift. Someone or something changed the infrastructure outside of Terraform.

Agent Drift Protocol#

1. Run: terraform plan -refresh-only
   → This shows what changed in reality vs what state recorded

2. Identify the drift:
   - Which resources drifted?
   - Which attributes changed?
   - Is the drift in tags only (low risk) or in configuration (higher risk)?

3. Investigate:
   - Check CloudTrail/audit logs for who made the change
   - Check if the change was intentional (maintenance, hotfix, console click)

4. Present options to human:
   a. Accept the drift (update Terraform code to match reality)
   b. Revert the drift (apply to restore Terraform's desired state)
   c. Investigate further before deciding

5. Never auto-revert drift. The real-world state may be correct
   (e.g., a security patch applied manually during an incident).

Common Drift Sources#

Source	Example	Typical Resolution
Console clicks	Someone changed an instance type in the AWS console	Update code to match or revert via apply
Auto-scaling	An ASG scaled up, changing desired_count	Add `ignore_changes = [desired_capacity]` to lifecycle
AWS managed updates	RDS minor version auto-upgrade	Update code to match the new version
Another Terraform workspace	A shared resource modified by a different team’s Terraform	Coordinate with the other team
Security incident response	Manual security group changes during an incident	Update code to match post-incident state

Operations That Require Special Care#

Importing Existing Resources#

# Safe pattern: declare the import, review the plan
import {
  to = aws_vpc.existing
  id = "vpc-0abc123def456"
}

# Then run: terraform plan
# Review the plan — it should show the resource being imported with no changes
# If changes appear, the resource block does not match reality — fix the code first

Never terraform import from the CLI in automation. Use declarative import blocks so the intent is visible in code and reviewable in a PR.

Destroying Individual Resources#

If a human asks you to remove a specific resource:

Remove the resource block from the code
Run terraform plan — it should show exactly 1 resource being destroyed
If the plan shows cascading destroys (dependent resources also being destroyed), stop and flag it
Present the full destroy chain to the human before applying

Working with Sensitive Resources#

For resources that contain secrets or sensitive data (RDS, Secrets Manager, KMS):

Never log or display attribute values marked as sensitive
When summarizing plan output, redact sensitive fields: “RDS password: (sensitive, not displayed)”
If a plan forces replacement of a secret-containing resource, flag it — the new resource will have a new secret value that downstream consumers need updating

The Checklist#

Before every terraform apply, verify:

terraform plan -out=tfplan ran successfully
Plan output reviewed for destroys and replaces
No stateful resources being replaced without explicit acknowledgment
No unexpected resources changing (drift or dependency side effects)
Plan summary presented to human
Human explicitly approved the apply
No state lock conflicts
The plan file being applied is the same one that was reviewed (not re-planned)

After every terraform apply:

Apply completed without errors
terraform plan shows “No changes” (state is consistent)
Human notified of completion with summary of what was created/modified