Infrastructure Disaster Recovery with Terraform#

Application disaster recovery is well-understood: replicate data, failover traffic, restore from backups. Infrastructure disaster recovery is different — you are recovering the platform that applications run on. If your Terraform state is lost, your VPC is deleted, or an entire region goes down, how do you rebuild?

This article covers the DR patterns specific to Terraform-managed infrastructure: protecting state, recovering from state loss, designing infrastructure for regional failover, and the runbooks that agents and operators need when things go wrong.

State File Disaster Recovery#

The state file is the most critical artifact in your Terraform infrastructure. Lose it, and Terraform no longer knows what it manages. Every resource becomes invisible to Terraform — they exist in the cloud but cannot be planned, changed, or destroyed through code.

State Backup Strategy#

Backend	Built-in Versioning	Recovery Method
S3	Yes (enable bucket versioning)	Restore previous version from S3
Azure Blob	Yes (enable blob versioning)	Restore previous version
GCS	Yes (enable object versioning)	Restore previous generation
Terraform Cloud	Yes (automatic)	Roll back to previous state version in UI
Local file	No	Hope you have a backup

# S3 backend with versioning — every state change is preserved
resource "aws_s3_bucket" "terraform_state" {
  bucket = "myorg-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Cross-region replication for state bucket DR
resource "aws_s3_bucket_replication_configuration" "state_replication" {
  bucket = aws_s3_bucket.terraform_state.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "replicate-state"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.terraform_state_dr.arn
      storage_class = "STANDARD"
    }
  }
}

Recovering from State Corruption#

If the state file is corrupted (invalid JSON, missing resources, incorrect resource addresses):

# Step 1: Download the current (corrupted) state
terraform state pull > corrupted.tfstate

# Step 2: List previous versions (S3 example)
aws s3api list-object-versions \
  --bucket myorg-terraform-state \
  --prefix production/terraform.tfstate \
  --max-items 10

# Step 3: Download the last known good version
aws s3api get-object \
  --bucket myorg-terraform-state \
  --key production/terraform.tfstate \
  --version-id "abc123" \
  last-good.tfstate

# Step 4: Verify the old state is valid
terraform show -json last-good.tfstate | jq '.values.root_module.resources | length'

# Step 5: Push the recovered state
terraform state push last-good.tfstate

# Step 6: Run plan to detect any drift since the recovered state
terraform plan

Gotcha: After recovering an older state version, terraform plan will show any changes made between the recovered version and now. Review carefully — some changes may be intentional (applied after the recovered snapshot).

Recovering from Complete State Loss#

If the state file is completely lost (bucket deleted, no backups):

# Option 1: Re-import everything (preferred)
# Write import blocks for all managed resources
# See terraform-import-brownfield article for the full procedure

# Option 2: Start fresh with a new state file
# Terraform will try to create all resources, which will fail because they exist
# For each failure, import the existing resource
terraform apply 2>&1 | grep "already exists"
# Import each conflicting resource
terraform import aws_vpc.main vpc-0abc123
terraform import aws_subnet.public subnet-0abc123
# ... repeat for all resources

Prevention is better than recovery: Enable versioning on the state bucket, enable cross-region replication, and test state recovery quarterly.

Blue-Green Infrastructure#

Blue-green deployment at the infrastructure level: maintain two complete environments and switch traffic between them.

The Pattern#

                    DNS / Load Balancer
                           │
                    ┌──────┴──────┐
                    │             │
               Blue (active)  Green (standby)
               ┌─────────┐   ┌─────────┐
               │ VPC      │   │ VPC      │
               │ EKS      │   │ EKS      │
               │ RDS      │   │ RDS      │
               │ App      │   │ App      │
               └─────────┘   └─────────┘

Terraform Implementation#

variable "active_environment" {
  type    = string
  default = "blue"  # or "green"
}

module "blue" {
  source      = "./modules/environment"
  name        = "blue"
  vpc_cidr    = "10.0.0.0/16"
  db_instance = "db.r6g.large"
}

module "green" {
  source      = "./modules/environment"
  name        = "green"
  vpc_cidr    = "10.1.0.0/16"
  db_instance = "db.r6g.large"
}

# DNS points to the active environment
resource "aws_route53_record" "app" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  alias {
    name                   = var.active_environment == "blue" ? module.blue.alb_dns_name : module.green.alb_dns_name
    zone_id                = var.active_environment == "blue" ? module.blue.alb_zone_id : module.green.alb_zone_id
    evaluate_target_health = true
  }
}

Failover: Change active_environment from “blue” to “green” and apply. DNS updates in seconds. The old environment stays running as a rollback target.

Cost concern: Running two complete environments doubles infrastructure cost. For most teams, active-passive (standby environment is smaller or paused) is more practical than active-active.

Active-Passive Variant#

module "primary" {
  source       = "./modules/environment"
  name         = "primary"
  vpc_cidr     = "10.0.0.0/16"
  db_instance  = "db.r6g.large"
  node_count   = 3
}

module "standby" {
  source       = "./modules/environment"
  name         = "standby"
  vpc_cidr     = "10.1.0.0/16"
  db_instance  = "db.r6g.medium"   # smaller in standby
  node_count   = 1                  # minimal in standby
}

During failover, scale up the standby before switching traffic:

# 1. Scale up standby
terraform apply -var="standby_node_count=3" -var="standby_db_instance=db.r6g.large"

# 2. Wait for scaling to complete

# 3. Switch traffic
terraform apply -var="active_environment=standby"

Cross-Region DR with Terraform#

Multi-Region Provider Configuration#

provider "aws" {
  alias  = "primary"
  region = "us-east-1"
}

provider "aws" {
  alias  = "dr"
  region = "us-west-2"
}

# Primary region infrastructure
module "primary" {
  source = "./modules/region"
  providers = {
    aws = aws.primary
  }
  region       = "us-east-1"
  vpc_cidr     = "10.0.0.0/16"
  is_primary   = true
}

# DR region infrastructure
module "dr" {
  source = "./modules/region"
  providers = {
    aws = aws.dr
  }
  region       = "us-west-2"
  vpc_cidr     = "10.1.0.0/16"
  is_primary   = false
}

Cross-Region Data Replication#

# RDS cross-region read replica
resource "aws_db_instance" "dr_replica" {
  provider             = aws.dr
  replicate_source_db  = module.primary.db_arn
  instance_class       = "db.r6g.large"
  skip_final_snapshot  = false

  # Replica becomes standalone primary during failover
  # Promotion is a manual or scripted operation, not a Terraform change
}

# S3 cross-region replication
resource "aws_s3_bucket_replication_configuration" "data" {
  provider = aws.primary
  bucket   = module.primary.data_bucket_id
  role     = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"
    destination {
      bucket = module.dr.data_bucket_arn
    }
  }
}

# Global Accelerator or Route53 health check for automatic failover
resource "aws_route53_health_check" "primary" {
  fqdn              = module.primary.endpoint
  port              = 443
  type              = "HTTPS"
  request_interval  = 10
  failure_threshold = 3
}

resource "aws_route53_record" "app" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id

  alias {
    name    = module.primary.alb_dns_name
    zone_id = module.primary.alb_zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "app_dr" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  failover_routing_policy {
    type = "SECONDARY"
  }

  set_identifier = "dr"

  alias {
    name    = module.dr.alb_dns_name
    zone_id = module.dr.alb_zone_id
    evaluate_target_health = true
  }
}

Immutable Infrastructure Rebuild#

The strongest DR guarantee: the ability to rebuild everything from code. If you can terraform apply on an empty cloud account and get a working environment, your DR is as good as it gets.

Testing the Rebuild#

# Quarterly DR test: rebuild from scratch in a test account
export AWS_PROFILE=dr-test-account

# Apply all layers in order
cd terraform/networking && terraform init && terraform apply -auto-approve
cd ../identity   && terraform init && terraform apply -auto-approve
cd ../database   && terraform init && terraform apply -auto-approve
cd ../compute    && terraform init && terraform apply -auto-approve

# Run smoke tests against the rebuilt environment
./scripts/smoke-test.sh https://dr-test.example.com

# Tear down after test
cd ../compute    && terraform destroy -auto-approve
cd ../database   && terraform destroy -auto-approve
cd ../identity   && terraform destroy -auto-approve
cd ../networking && terraform destroy -auto-approve

What Prevents Clean Rebuilds#

Blocker	Example	Fix
Hardcoded IDs	`subnet_id = "subnet-0abc123"`	Use data sources or `terraform_remote_state`
Manual steps	“Then click Create in the console”	Automate everything in Terraform
Undocumented dependencies	“This needs the VPN to be up first”	Add `depends_on` and document in README
External data not in code	Database content, uploaded files	Separate data DR (backups) from infra DR (Terraform)
Secret bootstrapping	“First create the KMS key manually”	Use Terraform to create KMS keys, bootstrap Vault
DNS delegation	“NS records set in the registrar”	Document manual steps, automate what you can
Global unique names	S3 bucket name already taken	Use naming convention with account ID or random suffix

DR Runbook Template#

Every Terraform-managed environment should have a runbook for common disaster scenarios.

Scenario 1: Single Resource Deleted#

SITUATION: A critical resource was manually deleted (e.g., someone deleted the RDS instance)

STEPS:
1. Do NOT run terraform apply immediately (it will try to recreate, possibly with wrong settings)
2. Run: terraform plan
   - If plan shows "will be created": verify the configuration matches what was deleted
   - If plan shows errors: the resource may have dependencies that also need attention
3. If the resource had data (RDS, S3): restore from backup FIRST
4. Run: terraform apply
5. Verify the recreated resource is correct
6. Investigate: who deleted it and why (audit logs)

Scenario 2: State File Lost#

SITUATION: The state file is missing or corrupted, and no backup version is available

STEPS:
1. Do NOT run terraform apply (it will try to create everything, conflicting with existing resources)
2. Check for state backups:
   - S3 versioning: aws s3api list-object-versions --bucket STATE_BUCKET --prefix STATE_KEY
   - Azure snapshots: az storage blob list --account-name ACCOUNT --container tfstate --include s
   - GCS generations: gsutil ls -la gs://STATE_BUCKET/STATE_KEY
3. If backup found: terraform state push recovered.tfstate
4. If no backup: re-import all resources (see terraform-import-brownfield article)
5. Run: terraform plan — must show "No changes" before proceeding

Scenario 3: Region Outage#

SITUATION: The primary region is down, need to fail over to DR region

STEPS:
1. Verify primary is actually down (not a monitoring false positive)
2. If automatic failover (Route53 health checks): verify traffic is routing to DR
3. If manual failover:
   a. Promote RDS read replica: aws rds promote-read-replica --db-instance-identifier DR_INSTANCE
   b. Update DNS: terraform apply -var="active_region=dr"
   c. Scale up DR compute: terraform apply -var="dr_node_count=PRODUCTION_SIZE"
4. Verify DR environment is serving traffic correctly
5. After primary recovers:
   a. Do NOT immediately fail back (ensure primary is stable)
   b. Re-sync data from DR to primary
   c. Fail back during a maintenance window

Scenario 4: Terraform State Locked#

SITUATION: State lock stuck, blocking all operations

STEPS:
1. Identify who holds the lock:
   terraform force-unlock LOCK_ID  (shows lock info without unlocking)
2. Check if a Terraform process is still running (CI/CD pipeline, another engineer)
3. If the process crashed:
   terraform force-unlock LOCK_ID
4. If the process is still running: wait for it to complete
5. Run: terraform plan — verify state is consistent after the lock is cleared

DR Testing Checklist#

Test these quarterly:

State backup restore — download an older state version, push it, run plan
Single resource recreation — delete a non-critical resource, let Terraform recreate it
Cross-region failover — switch traffic to DR region, verify functionality
Full rebuild — destroy and recreate a non-production environment from code
Runbook accuracy — walk through each runbook step to verify commands still work
Secret recovery — verify you can access secrets needed for DR (Vault unsealing, cloud credentials)
Communication — verify the DR contact list and notification channels are current