Refactoring Terraform#

Terraform configurations grow organically. A project starts with 10 resources in one directory. Six months later it has 80 resources, 3 levels of modules, and a state file that takes 2 minutes to plan. Changes feel risky because everything is interconnected. New team members (or agents) cannot understand the structure without reading every file.

Refactoring addresses this — but Terraform refactoring is harder than code refactoring because the state file maps resource addresses to real infrastructure. Rename a resource and Terraform thinks you want to destroy the old one and create a new one. Move a resource into a module and Terraform plans to recreate it. Every structural change requires corresponding state manipulation.

When to Refactor#

Signals That Refactoring Is Needed#

Signal	What It Means	Severity
`terraform plan` takes > 60 seconds	State file is too large; refreshing all resources is slow	Moderate
`terraform state list` shows > 50 resources	Single state file covers too much; blast radius is everything	High
Module nesting is 3+ levels deep	Agent/human context cost for understanding is too high	Moderate
Two teams need to modify the same directory	State lock conflicts block parallel work	High
A change to networking requires re-planning the database	Unrelated concerns share state, creating coupling	High
Adding a new environment means duplicating 500 lines	No reusable structure; environments diverge over time	Moderate
`variables.tf` has 40+ variables	Module interface is too broad; doing too many things	Moderate

When NOT to Refactor#

The configuration is small (< 30 resources) and stable — refactoring adds complexity without benefit
You are about to make a time-sensitive change — refactor after, not during
The only complaint is “it is not DRY” — DRY is not a goal in infrastructure code, maintainability is
You are the only person working on it and the structure works for you

Strategy 1: State Decomposition (Splitting a Monolith)#

The most impactful refactoring: splitting one state file into multiple independent root modules.

Before#

infrastructure/
├── main.tf           # VPC, subnets, EKS, RDS, S3, IAM — everything
├── variables.tf
├── outputs.tf
└── backend.tf        # key = "infrastructure/terraform.tfstate"

State: 80 resources in one file. One lock. One blast radius.

After#

infrastructure/
├── networking/       # VPC, subnets, routes, NAT, IGW — 15 resources
├── database/         # RDS, subnet group, security group — 10 resources
├── compute/          # EKS, node groups, IRSA — 20 resources
└── application/      # Helm releases, K8s resources — 35 resources

Four state files. Four locks. Four independent blast radii.

The Decomposition Procedure#

Step 1: Plan the split. Draw dependency boundaries:

networking (no dependencies)
    ↓
database (needs: subnet_ids, vpc_id from networking)
compute  (needs: subnet_ids, vpc_id from networking)
    ↓
application (needs: cluster_endpoint from compute, db_endpoint from database)

Resources that reference each other must be in the same module or connected via terraform_remote_state.

Step 2: Create the new root module structure. For each new root module, create the directory with providers.tf, backend.tf, variables.tf, and outputs.tf.

Step 3: Move resources one module at a time. Start with the module that has no dependencies (networking):

# 1. Move resource addresses in state
terraform state mv aws_vpc.main module.networking_temp.aws_vpc.main
# Repeat for all networking resources

# 2. Actually, use the multi-state mv approach:
# In the OLD root module:
cd infrastructure/
terraform state mv -state=terraform.tfstate -state-out=../networking/terraform.tfstate \
  aws_vpc.main aws_vpc.main

# 3. Move the corresponding .tf code to the new directory

# 4. In the new directory, run terraform plan
cd ../networking/
terraform init
terraform plan
# Should show: No changes (state matches code)

Step 4: Add cross-state data sources. In the database module:

# database/data.tf
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "myorg-tfstate"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}

Replace direct resource references with remote state references:

# Before: aws_vpc.main.id
# After:  data.terraform_remote_state.networking.outputs.vpc_id

Step 5: Verify each module independently. Run terraform plan in each new root module. All should show “No changes.”

Step 6: Remove the old monolith. Once all resources have been moved out and verified, the old root module is empty. Delete it.

Safety Rules for State Decomposition#

Always back up state before moving: terraform state pull > backup-$(date +%Y%m%d).tfstate
Move one concern at a time. Complete networking before starting database.
Verify after each move. terraform plan should show zero changes.
Do not mix moves with code changes. The refactoring PR should have zero infrastructure changes — only structural reorganization.

Strategy 2: Module Extraction#

Converting inline resources into a reusable module — without destroying and recreating them.

Using moved Blocks (Terraform 1.1+)#

# Before: resources defined inline
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "private_a" { ... }
resource "aws_subnet" "private_b" { ... }

# After: resources moved into a module
module "networking" {
  source = "./modules/networking"
  # ... variables ...
}

# Tell Terraform these are the same resources
moved {
  from = aws_vpc.main
  to   = module.networking.aws_vpc.main
}

moved {
  from = aws_subnet.private_a
  to   = module.networking.aws_subnet.private_a
}

moved {
  from = aws_subnet.private_b
  to   = module.networking.aws_subnet.private_b
}

Run terraform plan — it should show moves, not creates/destroys:

  # aws_vpc.main has moved to module.networking.aws_vpc.main
    resource "aws_vpc" "main" {
        id                               = "vpc-0abc123"
        # (no changes)
    }

After a successful apply, remove the moved blocks. Keep them for one release cycle if multiple environments apply separately.

When moved Blocks Cannot Help#

moved blocks do not work across state files. If you are extracting resources into a different root module (state decomposition), use terraform state mv instead.

Strategy 3: Workspace to Directory Migration#

Moving from workspaces (same code, different state) to directories (different code per environment).

Why Migrate#

Workspaces assume all environments have the same structure. When production needs a larger database or staging needs a debugging sidecar, you end up with:

resource "aws_db_instance" "main" {
  instance_class = terraform.workspace == "prod" ? "db.r5.xlarge" : "db.t3.micro"
  multi_az       = terraform.workspace == "prod" ? true : false
  # ... more ternaries for every difference
}

Directories allow genuine structural differences between environments without conditional gymnastics.

Migration Procedure#

# 1. Export each workspace's state
terraform workspace select staging
terraform state pull > staging.tfstate

terraform workspace select prod
terraform state pull > prod.tfstate

# 2. Create directory structure
mkdir -p envs/staging envs/prod

# 3. Copy code to each directory, adjust backend keys
# envs/staging/backend.tf: key = "staging/terraform.tfstate"
# envs/prod/backend.tf:    key = "prod/terraform.tfstate"

# 4. Push state to new backends
cd envs/staging
terraform init
terraform state push ../../staging.tfstate
terraform plan  # should show No changes

cd ../prod
terraform init
terraform state push ../../prod.tfstate
terraform plan  # should show No changes

# 5. Delete old workspaces (after verifying both environments work)

Strategy 4: Provider Version Upgrades#

Major provider version upgrades (e.g., AWS provider 4.x → 5.x) can introduce breaking changes.

Safe Upgrade Procedure#

# 1. Read the upgrade guide (always published for major versions)
# AWS 5.0: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/version-5-upgrade

# 2. Update the version constraint
# version = "~> 4.0" → version = "~> 5.0"

# 3. Run terraform init -upgrade

# 4. Run terraform plan
# The plan will show changes caused by the upgrade (renamed arguments,
# changed defaults, deprecated resources)

# 5. Fix each issue the plan reveals
# - Rename deprecated arguments
# - Update resource types that were split or merged
# - Adjust for changed default values

# 6. Repeat plan/fix until plan shows no unexpected changes

# 7. Apply with human approval

Agent Protocol for Upgrades#

Read the provider changelog and upgrade guide
Make the version change and run init -upgrade
Run plan and categorize every change:
- Expected (documented in upgrade guide) → fix the code
- Unexpected (not in upgrade guide) → investigate before proceeding
Present the full list of changes to the human with classification
Apply only after all changes are understood and approved

Refactoring Checklist#

Before starting any refactoring:

State backed up (terraform state pull > backup.tfstate)
Current plan is clean (terraform plan shows “No changes” before starting)
No pending PRs that modify the same Terraform code
Refactoring PR contains ONLY structural changes (no infrastructure modifications)
Each move verified with terraform plan showing zero changes
Cross-state references tested (terraform plan in dependent modules passes)
Documentation updated (CLAUDE.md, README, or architecture docs)