Terraform Safety for Agents#
Terraform is the most dangerous tool most agents have access to. A single terraform apply can create, modify, or destroy real infrastructure — databases with production data, networking that carries live traffic, security groups that protect running services. There is no undo button. terraform destroy is not an undo — it is a different destructive action.
This article defines the safety protocols agents must follow when working with Terraform: what to check before every plan, how to read plan output for danger, how to present plans to humans, when to apply vs when to stop, and how to handle state conflicts.
The Cardinal Rules#
- Never run
terraform applywithout human approval. Not even for “safe” changes. The human reviews the plan output and explicitly approves. - Never run
terraform applywithout a saved plan file. Alwaysplan -out=tfplanthenapply tfplan. Neverapplywith no arguments — it recomputes the plan and might produce different results. - Never run
terraform destroyautonomously. Destruction requires explicit human instruction. - Never force-unlock a state file without investigating who holds the lock.
- Never use
-targetin automation. It is a debugging tool for humans, not a workflow shortcut. - Always run
planbefore anyapply. Read the plan. Understand the plan. Present the plan. Then wait.
The Safe Workflow#
Agent receives task: "Add a new subnet to the VPC"
1. Read the current Terraform files (understand what exists)
2. Read the state file listing (terraform state list — understand what is managed)
3. Write the change (add the subnet resource block)
4. Run terraform plan -out=tfplan
5. Read the plan output carefully
6. Classify the risk level
7. Present the plan summary to the human
8. WAIT for approval
9. On approval: terraform apply tfplan
10. Verify: terraform plan (should show "No changes")
11. Report result to humanSteps 4-8 are the safety gate. The agent does not skip from step 3 to step 9.
Reading Plan Output for Danger#
Terraform plan output uses symbols to indicate what will happen:
+ create (new resource, lowest risk)
~ update in-place (modify existing, moderate risk)
-/+ destroy and recreate (replaces resource, HIGH risk for stateful resources)
- destroy (deletes resource, HIGH risk)
<= read (data source refresh, no risk)Danger Signals in Plan Output#
| Signal | What It Means | Risk Level | Agent Action |
|---|---|---|---|
Any - (destroy) |
A resource will be deleted | High | Always flag to human. Never auto-approve. |
Any -/+ (replace) |
A resource will be destroyed and recreated | High | Flag to human. Stateful resources (databases, volumes) lose data on replace. |
forces replacement in change detail |
An attribute change requires destroying and recreating | High | Identify which attribute forced the replacement and flag it. |
Changes to aws_security_group rules |
Network access control is changing | Moderate | Summarize what ports/CIDRs are being added or removed. |
| Changes to IAM policies or roles | Permissions are changing | Moderate-High | Summarize what permissions are being granted or revoked. |
~ tags only |
Only tags are changing | Low | Mention but do not flag as dangerous. |
| More resources changing than expected | The change should affect 1 resource but 5 are changing | Moderate | Investigate why. Possible state drift or unexpected dependency. |
(known after apply) on critical attributes |
Terraform cannot predict the value until apply | Low (usually normal) | Note it but do not flag unless the attribute is security-sensitive. |
Interpreting Replace Operations#
Replace (-/+) is the most dangerous common operation. It means Terraform must destroy the old resource before creating the new one. For stateless resources (security groups, IAM roles), this is usually fine. For stateful resources, it means data loss:
Resources that lose data on replace:
- aws_db_instance (database — all data lost)
- aws_rds_cluster (database — all data lost)
- aws_ebs_volume (disk — all data lost unless snapshot exists)
- aws_elasticache_cluster (cache — all data lost)
- aws_s3_bucket (bucket — name reclaimed, objects may be lost)
- aws_efs_file_system (filesystem — all data lost)
- aws_dynamodb_table (table — all data lost unless backup exists)
- kubernetes_persistent_volume_claim (volume — data lost)
Resources safe to replace:
- aws_security_group (recreated, rules re-applied)
- aws_iam_role (recreated, policies re-attached)
- aws_launch_template (version incremented)
- aws_instance (if stateless — ephemeral compute)
- kubernetes_deployment (pods restart with new config)When you see -/+ on a stateful resource, stop and escalate. Tell the human which resource is being replaced, why (which attribute forced the replacement), and what data would be lost.
Presenting Plans to Humans#
The human does not want to read 200 lines of Terraform plan output. Summarize it:
The Plan Summary Template#
Terraform plan summary:
Create: 2 resources
+ aws_subnet.private_c (new subnet in us-east-1c)
+ aws_route_table_association.private_c
Modify: 1 resource
~ aws_security_group.eks_nodes (adding ingress rule for new subnet CIDR)
Destroy: 0 resources
Risk: Low
- All changes are additive (new subnet + route table association)
- Security group change adds a rule (does not remove existing rules)
- No stateful resources affected
Estimated impact: New subnet available for EKS node scheduling in us-east-1c.
Reversible: Yes — remove the subnet resource and apply.
Approve apply? [The full plan output is available if you want to review details]What the Summary Must Always Include#
- Resource counts by action type (create/modify/destroy)
- Named resources (not just counts — “aws_db_instance.main” not “1 resource”)
- Risk level with reasoning
- Any destroys or replaces highlighted explicitly
- Impact statement in plain language
- Reversibility assessment
Escalation Triggers#
Present the plan and recommend not applying when:
- Any stateful resource is being replaced
- More resources are changing than the task requires
- Security group rules are being removed (not added)
- IAM permissions are being broadened significantly
- The plan shows drift-related changes mixed with intentional changes
- Resources in a different environment or region appear in the plan
I am flagging this plan for review before applying.
The plan shows 3 resources changing, but the task only required adding
a subnet (1-2 resources). The additional change is:
~ aws_rds_cluster.main
- engine_version: "15.4" → "15.6"
This database version change was not part of the requested work. It may
be caused by a provider update or a default value change. Applying it
would trigger a database maintenance window.
Options:
1. Apply only the subnet changes (use lifecycle ignore_changes temporarily)
2. Investigate the engine version change before applying anything
3. Apply everything if the version upgrade is desired
I recommend option 2.State Lock Protocols#
When You Encounter a Lock#
Error: Error locking state: Error acquiring the state lock
Lock Info:
ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Operation: OperationTypeApply
Who: runner@github-actions
Created: 2026-02-22 14:32:01 UTCAgent protocol:
- Read the lock info. Who holds it? When was it created? What operation?
- Check if the operation is still running. If “Who” is a CI runner, check the CI pipeline. If it is a user, ask the human.
- Wait before escalating. Use
-lock-timeout=5mon your next plan to wait for the lock to release naturally. - Never force-unlock autonomously. Force-unlocking a lock held by an active operation corrupts the state.
- Escalate to the human with the lock info and your assessment of whether the holding operation is still active.
The state file is locked by a CI pipeline run (runner@github-actions)
that started 45 minutes ago. CI jobs in this repo typically complete
in 10-15 minutes.
This lock may be stale (from a crashed or timed-out job).
Options:
1. Wait — retry with -lock-timeout=5m in case it releases
2. Check the CI dashboard to confirm the job is dead
3. Force-unlock (only if confirmed the job is no longer running)
I will try option 1 first. If it does not release, we should check CI
before force-unlocking.After a Failed Apply#
If terraform apply fails partway through (some resources created, some not):
- Do not panic. Terraform state reflects what actually happened — resources that were created are in state, resources that failed are not.
- Run
terraform plan. It will show the remaining changes needed to reach the desired state. - Investigate the failure. Read the error message. Common causes: rate limits, permission denied, resource conflicts.
- Fix the cause and re-plan. Do not blindly re-apply. The fix might change the plan.
- Report to human what was partially applied, what failed, and the recovery plan.
Drift Investigation#
When terraform plan shows unexpected changes — resources being modified that you did not change in code — this is drift. Someone or something changed the infrastructure outside of Terraform.
Agent Drift Protocol#
1. Run: terraform plan -refresh-only
→ This shows what changed in reality vs what state recorded
2. Identify the drift:
- Which resources drifted?
- Which attributes changed?
- Is the drift in tags only (low risk) or in configuration (higher risk)?
3. Investigate:
- Check CloudTrail/audit logs for who made the change
- Check if the change was intentional (maintenance, hotfix, console click)
4. Present options to human:
a. Accept the drift (update Terraform code to match reality)
b. Revert the drift (apply to restore Terraform's desired state)
c. Investigate further before deciding
5. Never auto-revert drift. The real-world state may be correct
(e.g., a security patch applied manually during an incident).Common Drift Sources#
| Source | Example | Typical Resolution |
|---|---|---|
| Console clicks | Someone changed an instance type in the AWS console | Update code to match or revert via apply |
| Auto-scaling | An ASG scaled up, changing desired_count | Add ignore_changes = [desired_capacity] to lifecycle |
| AWS managed updates | RDS minor version auto-upgrade | Update code to match the new version |
| Another Terraform workspace | A shared resource modified by a different team’s Terraform | Coordinate with the other team |
| Security incident response | Manual security group changes during an incident | Update code to match post-incident state |
Operations That Require Special Care#
Importing Existing Resources#
# Safe pattern: declare the import, review the plan
import {
to = aws_vpc.existing
id = "vpc-0abc123def456"
}
# Then run: terraform plan
# Review the plan — it should show the resource being imported with no changes
# If changes appear, the resource block does not match reality — fix the code firstNever terraform import from the CLI in automation. Use declarative import blocks so the intent is visible in code and reviewable in a PR.
Destroying Individual Resources#
If a human asks you to remove a specific resource:
- Remove the resource block from the code
- Run
terraform plan— it should show exactly 1 resource being destroyed - If the plan shows cascading destroys (dependent resources also being destroyed), stop and flag it
- Present the full destroy chain to the human before applying
Working with Sensitive Resources#
For resources that contain secrets or sensitive data (RDS, Secrets Manager, KMS):
- Never log or display attribute values marked as
sensitive - When summarizing plan output, redact sensitive fields: “RDS password: (sensitive, not displayed)”
- If a plan forces replacement of a secret-containing resource, flag it — the new resource will have a new secret value that downstream consumers need updating
The Checklist#
Before every terraform apply, verify:
-
terraform plan -out=tfplanran successfully - Plan output reviewed for destroys and replaces
- No stateful resources being replaced without explicit acknowledgment
- No unexpected resources changing (drift or dependency side effects)
- Plan summary presented to human
- Human explicitly approved the apply
- No state lock conflicts
- The plan file being applied is the same one that was reviewed (not re-planned)
After every terraform apply:
- Apply completed without errors
-
terraform planshows “No changes” (state is consistent) - Human notified of completion with summary of what was created/modified