Refactoring Terraform#

Terraform configurations grow organically. A project starts with 10 resources in one directory. Six months later it has 80 resources, 3 levels of modules, and a state file that takes 2 minutes to plan. Changes feel risky because everything is interconnected. New team members (or agents) cannot understand the structure without reading every file.

Refactoring addresses this — but Terraform refactoring is harder than code refactoring because the state file maps resource addresses to real infrastructure. Rename a resource and Terraform thinks you want to destroy the old one and create a new one. Move a resource into a module and Terraform plans to recreate it. Every structural change requires corresponding state manipulation.

When to Refactor#

Signals That Refactoring Is Needed#

Signal What It Means Severity
terraform plan takes > 60 seconds State file is too large; refreshing all resources is slow Moderate
terraform state list shows > 50 resources Single state file covers too much; blast radius is everything High
Module nesting is 3+ levels deep Agent/human context cost for understanding is too high Moderate
Two teams need to modify the same directory State lock conflicts block parallel work High
A change to networking requires re-planning the database Unrelated concerns share state, creating coupling High
Adding a new environment means duplicating 500 lines No reusable structure; environments diverge over time Moderate
variables.tf has 40+ variables Module interface is too broad; doing too many things Moderate

When NOT to Refactor#

  • The configuration is small (< 30 resources) and stable — refactoring adds complexity without benefit
  • You are about to make a time-sensitive change — refactor after, not during
  • The only complaint is “it is not DRY” — DRY is not a goal in infrastructure code, maintainability is
  • You are the only person working on it and the structure works for you

Strategy 1: State Decomposition (Splitting a Monolith)#

The most impactful refactoring: splitting one state file into multiple independent root modules.

Before#

infrastructure/
├── main.tf           # VPC, subnets, EKS, RDS, S3, IAM — everything
├── variables.tf
├── outputs.tf
└── backend.tf        # key = "infrastructure/terraform.tfstate"

State: 80 resources in one file. One lock. One blast radius.

After#

infrastructure/
├── networking/       # VPC, subnets, routes, NAT, IGW — 15 resources
├── database/         # RDS, subnet group, security group — 10 resources
├── compute/          # EKS, node groups, IRSA — 20 resources
└── application/      # Helm releases, K8s resources — 35 resources

Four state files. Four locks. Four independent blast radii.

The Decomposition Procedure#

Step 1: Plan the split. Draw dependency boundaries:

networking (no dependencies)
    ↓
database (needs: subnet_ids, vpc_id from networking)
compute  (needs: subnet_ids, vpc_id from networking)
    ↓
application (needs: cluster_endpoint from compute, db_endpoint from database)

Resources that reference each other must be in the same module or connected via terraform_remote_state.

Step 2: Create the new root module structure. For each new root module, create the directory with providers.tf, backend.tf, variables.tf, and outputs.tf.

Step 3: Move resources one module at a time. Start with the module that has no dependencies (networking):

# 1. Move resource addresses in state
terraform state mv aws_vpc.main module.networking_temp.aws_vpc.main
# Repeat for all networking resources

# 2. Actually, use the multi-state mv approach:
# In the OLD root module:
cd infrastructure/
terraform state mv -state=terraform.tfstate -state-out=../networking/terraform.tfstate \
  aws_vpc.main aws_vpc.main

# 3. Move the corresponding .tf code to the new directory

# 4. In the new directory, run terraform plan
cd ../networking/
terraform init
terraform plan
# Should show: No changes (state matches code)

Step 4: Add cross-state data sources. In the database module:

# database/data.tf
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "myorg-tfstate"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}

Replace direct resource references with remote state references:

# Before: aws_vpc.main.id
# After:  data.terraform_remote_state.networking.outputs.vpc_id

Step 5: Verify each module independently. Run terraform plan in each new root module. All should show “No changes.”

Step 6: Remove the old monolith. Once all resources have been moved out and verified, the old root module is empty. Delete it.

Safety Rules for State Decomposition#

  • Always back up state before moving: terraform state pull > backup-$(date +%Y%m%d).tfstate
  • Move one concern at a time. Complete networking before starting database.
  • Verify after each move. terraform plan should show zero changes.
  • Do not mix moves with code changes. The refactoring PR should have zero infrastructure changes — only structural reorganization.

Strategy 2: Module Extraction#

Converting inline resources into a reusable module — without destroying and recreating them.

Using moved Blocks (Terraform 1.1+)#

# Before: resources defined inline
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "private_a" { ... }
resource "aws_subnet" "private_b" { ... }

# After: resources moved into a module
module "networking" {
  source = "./modules/networking"
  # ... variables ...
}

# Tell Terraform these are the same resources
moved {
  from = aws_vpc.main
  to   = module.networking.aws_vpc.main
}

moved {
  from = aws_subnet.private_a
  to   = module.networking.aws_subnet.private_a
}

moved {
  from = aws_subnet.private_b
  to   = module.networking.aws_subnet.private_b
}

Run terraform plan — it should show moves, not creates/destroys:

  # aws_vpc.main has moved to module.networking.aws_vpc.main
    resource "aws_vpc" "main" {
        id                               = "vpc-0abc123"
        # (no changes)
    }

After a successful apply, remove the moved blocks. Keep them for one release cycle if multiple environments apply separately.

When moved Blocks Cannot Help#

moved blocks do not work across state files. If you are extracting resources into a different root module (state decomposition), use terraform state mv instead.

Strategy 3: Workspace to Directory Migration#

Moving from workspaces (same code, different state) to directories (different code per environment).

Why Migrate#

Workspaces assume all environments have the same structure. When production needs a larger database or staging needs a debugging sidecar, you end up with:

resource "aws_db_instance" "main" {
  instance_class = terraform.workspace == "prod" ? "db.r5.xlarge" : "db.t3.micro"
  multi_az       = terraform.workspace == "prod" ? true : false
  # ... more ternaries for every difference
}

Directories allow genuine structural differences between environments without conditional gymnastics.

Migration Procedure#

# 1. Export each workspace's state
terraform workspace select staging
terraform state pull > staging.tfstate

terraform workspace select prod
terraform state pull > prod.tfstate

# 2. Create directory structure
mkdir -p envs/staging envs/prod

# 3. Copy code to each directory, adjust backend keys
# envs/staging/backend.tf: key = "staging/terraform.tfstate"
# envs/prod/backend.tf:    key = "prod/terraform.tfstate"

# 4. Push state to new backends
cd envs/staging
terraform init
terraform state push ../../staging.tfstate
terraform plan  # should show No changes

cd ../prod
terraform init
terraform state push ../../prod.tfstate
terraform plan  # should show No changes

# 5. Delete old workspaces (after verifying both environments work)

Strategy 4: Provider Version Upgrades#

Major provider version upgrades (e.g., AWS provider 4.x → 5.x) can introduce breaking changes.

Safe Upgrade Procedure#

# 1. Read the upgrade guide (always published for major versions)
# AWS 5.0: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/version-5-upgrade

# 2. Update the version constraint
# version = "~> 4.0" → version = "~> 5.0"

# 3. Run terraform init -upgrade

# 4. Run terraform plan
# The plan will show changes caused by the upgrade (renamed arguments,
# changed defaults, deprecated resources)

# 5. Fix each issue the plan reveals
# - Rename deprecated arguments
# - Update resource types that were split or merged
# - Adjust for changed default values

# 6. Repeat plan/fix until plan shows no unexpected changes

# 7. Apply with human approval

Agent Protocol for Upgrades#

  1. Read the provider changelog and upgrade guide
  2. Make the version change and run init -upgrade
  3. Run plan and categorize every change:
    • Expected (documented in upgrade guide) → fix the code
    • Unexpected (not in upgrade guide) → investigate before proceeding
  4. Present the full list of changes to the human with classification
  5. Apply only after all changes are understood and approved

Refactoring Checklist#

Before starting any refactoring:

  • State backed up (terraform state pull > backup.tfstate)
  • Current plan is clean (terraform plan shows “No changes” before starting)
  • No pending PRs that modify the same Terraform code
  • Refactoring PR contains ONLY structural changes (no infrastructure modifications)
  • Each move verified with terraform plan showing zero changes
  • Cross-state references tested (terraform plan in dependent modules passes)
  • Documentation updated (CLAUDE.md, README, or architecture docs)