Refactoring Terraform#
Terraform configurations grow organically. A project starts with 10 resources in one directory. Six months later it has 80 resources, 3 levels of modules, and a state file that takes 2 minutes to plan. Changes feel risky because everything is interconnected. New team members (or agents) cannot understand the structure without reading every file.
Refactoring addresses this — but Terraform refactoring is harder than code refactoring because the state file maps resource addresses to real infrastructure. Rename a resource and Terraform thinks you want to destroy the old one and create a new one. Move a resource into a module and Terraform plans to recreate it. Every structural change requires corresponding state manipulation.
When to Refactor#
Signals That Refactoring Is Needed#
| Signal | What It Means | Severity |
|---|---|---|
terraform plan takes > 60 seconds |
State file is too large; refreshing all resources is slow | Moderate |
terraform state list shows > 50 resources |
Single state file covers too much; blast radius is everything | High |
| Module nesting is 3+ levels deep | Agent/human context cost for understanding is too high | Moderate |
| Two teams need to modify the same directory | State lock conflicts block parallel work | High |
| A change to networking requires re-planning the database | Unrelated concerns share state, creating coupling | High |
| Adding a new environment means duplicating 500 lines | No reusable structure; environments diverge over time | Moderate |
variables.tf has 40+ variables |
Module interface is too broad; doing too many things | Moderate |
When NOT to Refactor#
- The configuration is small (< 30 resources) and stable — refactoring adds complexity without benefit
- You are about to make a time-sensitive change — refactor after, not during
- The only complaint is “it is not DRY” — DRY is not a goal in infrastructure code, maintainability is
- You are the only person working on it and the structure works for you
Strategy 1: State Decomposition (Splitting a Monolith)#
The most impactful refactoring: splitting one state file into multiple independent root modules.
Before#
infrastructure/
├── main.tf # VPC, subnets, EKS, RDS, S3, IAM — everything
├── variables.tf
├── outputs.tf
└── backend.tf # key = "infrastructure/terraform.tfstate"State: 80 resources in one file. One lock. One blast radius.
After#
infrastructure/
├── networking/ # VPC, subnets, routes, NAT, IGW — 15 resources
├── database/ # RDS, subnet group, security group — 10 resources
├── compute/ # EKS, node groups, IRSA — 20 resources
└── application/ # Helm releases, K8s resources — 35 resourcesFour state files. Four locks. Four independent blast radii.
The Decomposition Procedure#
Step 1: Plan the split. Draw dependency boundaries:
networking (no dependencies)
↓
database (needs: subnet_ids, vpc_id from networking)
compute (needs: subnet_ids, vpc_id from networking)
↓
application (needs: cluster_endpoint from compute, db_endpoint from database)Resources that reference each other must be in the same module or connected via terraform_remote_state.
Step 2: Create the new root module structure. For each new root module, create the directory with providers.tf, backend.tf, variables.tf, and outputs.tf.
Step 3: Move resources one module at a time. Start with the module that has no dependencies (networking):
# 1. Move resource addresses in state
terraform state mv aws_vpc.main module.networking_temp.aws_vpc.main
# Repeat for all networking resources
# 2. Actually, use the multi-state mv approach:
# In the OLD root module:
cd infrastructure/
terraform state mv -state=terraform.tfstate -state-out=../networking/terraform.tfstate \
aws_vpc.main aws_vpc.main
# 3. Move the corresponding .tf code to the new directory
# 4. In the new directory, run terraform plan
cd ../networking/
terraform init
terraform plan
# Should show: No changes (state matches code)Step 4: Add cross-state data sources. In the database module:
# database/data.tf
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "myorg-tfstate"
key = "networking/terraform.tfstate"
region = "us-east-1"
}
}Replace direct resource references with remote state references:
# Before: aws_vpc.main.id
# After: data.terraform_remote_state.networking.outputs.vpc_idStep 5: Verify each module independently. Run terraform plan in each new root module. All should show “No changes.”
Step 6: Remove the old monolith. Once all resources have been moved out and verified, the old root module is empty. Delete it.
Safety Rules for State Decomposition#
- Always back up state before moving:
terraform state pull > backup-$(date +%Y%m%d).tfstate - Move one concern at a time. Complete networking before starting database.
- Verify after each move.
terraform planshould show zero changes. - Do not mix moves with code changes. The refactoring PR should have zero infrastructure changes — only structural reorganization.
Strategy 2: Module Extraction#
Converting inline resources into a reusable module — without destroying and recreating them.
Using moved Blocks (Terraform 1.1+)#
# Before: resources defined inline
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "private_a" { ... }
resource "aws_subnet" "private_b" { ... }
# After: resources moved into a module
module "networking" {
source = "./modules/networking"
# ... variables ...
}
# Tell Terraform these are the same resources
moved {
from = aws_vpc.main
to = module.networking.aws_vpc.main
}
moved {
from = aws_subnet.private_a
to = module.networking.aws_subnet.private_a
}
moved {
from = aws_subnet.private_b
to = module.networking.aws_subnet.private_b
}Run terraform plan — it should show moves, not creates/destroys:
# aws_vpc.main has moved to module.networking.aws_vpc.main
resource "aws_vpc" "main" {
id = "vpc-0abc123"
# (no changes)
}After a successful apply, remove the moved blocks. Keep them for one release cycle if multiple environments apply separately.
When moved Blocks Cannot Help#
moved blocks do not work across state files. If you are extracting resources into a different root module (state decomposition), use terraform state mv instead.
Strategy 3: Workspace to Directory Migration#
Moving from workspaces (same code, different state) to directories (different code per environment).
Why Migrate#
Workspaces assume all environments have the same structure. When production needs a larger database or staging needs a debugging sidecar, you end up with:
resource "aws_db_instance" "main" {
instance_class = terraform.workspace == "prod" ? "db.r5.xlarge" : "db.t3.micro"
multi_az = terraform.workspace == "prod" ? true : false
# ... more ternaries for every difference
}Directories allow genuine structural differences between environments without conditional gymnastics.
Migration Procedure#
# 1. Export each workspace's state
terraform workspace select staging
terraform state pull > staging.tfstate
terraform workspace select prod
terraform state pull > prod.tfstate
# 2. Create directory structure
mkdir -p envs/staging envs/prod
# 3. Copy code to each directory, adjust backend keys
# envs/staging/backend.tf: key = "staging/terraform.tfstate"
# envs/prod/backend.tf: key = "prod/terraform.tfstate"
# 4. Push state to new backends
cd envs/staging
terraform init
terraform state push ../../staging.tfstate
terraform plan # should show No changes
cd ../prod
terraform init
terraform state push ../../prod.tfstate
terraform plan # should show No changes
# 5. Delete old workspaces (after verifying both environments work)Strategy 4: Provider Version Upgrades#
Major provider version upgrades (e.g., AWS provider 4.x → 5.x) can introduce breaking changes.
Safe Upgrade Procedure#
# 1. Read the upgrade guide (always published for major versions)
# AWS 5.0: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/version-5-upgrade
# 2. Update the version constraint
# version = "~> 4.0" → version = "~> 5.0"
# 3. Run terraform init -upgrade
# 4. Run terraform plan
# The plan will show changes caused by the upgrade (renamed arguments,
# changed defaults, deprecated resources)
# 5. Fix each issue the plan reveals
# - Rename deprecated arguments
# - Update resource types that were split or merged
# - Adjust for changed default values
# 6. Repeat plan/fix until plan shows no unexpected changes
# 7. Apply with human approvalAgent Protocol for Upgrades#
- Read the provider changelog and upgrade guide
- Make the version change and run
init -upgrade - Run
planand categorize every change:- Expected (documented in upgrade guide) → fix the code
- Unexpected (not in upgrade guide) → investigate before proceeding
- Present the full list of changes to the human with classification
- Apply only after all changes are understood and approved
Refactoring Checklist#
Before starting any refactoring:
- State backed up (
terraform state pull > backup.tfstate) - Current plan is clean (
terraform planshows “No changes” before starting) - No pending PRs that modify the same Terraform code
- Refactoring PR contains ONLY structural changes (no infrastructure modifications)
- Each move verified with
terraform planshowing zero changes - Cross-state references tested (
terraform planin dependent modules passes) - Documentation updated (CLAUDE.md, README, or architecture docs)