Infrastructure Disaster Recovery with Terraform#
Application disaster recovery is well-understood: replicate data, failover traffic, restore from backups. Infrastructure disaster recovery is different — you are recovering the platform that applications run on. If your Terraform state is lost, your VPC is deleted, or an entire region goes down, how do you rebuild?
This article covers the DR patterns specific to Terraform-managed infrastructure: protecting state, recovering from state loss, designing infrastructure for regional failover, and the runbooks that agents and operators need when things go wrong.
State File Disaster Recovery#
The state file is the most critical artifact in your Terraform infrastructure. Lose it, and Terraform no longer knows what it manages. Every resource becomes invisible to Terraform — they exist in the cloud but cannot be planned, changed, or destroyed through code.
State Backup Strategy#
| Backend | Built-in Versioning | Recovery Method |
|---|---|---|
| S3 | Yes (enable bucket versioning) | Restore previous version from S3 |
| Azure Blob | Yes (enable blob versioning) | Restore previous version |
| GCS | Yes (enable object versioning) | Restore previous generation |
| Terraform Cloud | Yes (automatic) | Roll back to previous state version in UI |
| Local file | No | Hope you have a backup |
# S3 backend with versioning — every state change is preserved
resource "aws_s3_bucket" "terraform_state" {
bucket = "myorg-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
# Cross-region replication for state bucket DR
resource "aws_s3_bucket_replication_configuration" "state_replication" {
bucket = aws_s3_bucket.terraform_state.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-state"
status = "Enabled"
destination {
bucket = aws_s3_bucket.terraform_state_dr.arn
storage_class = "STANDARD"
}
}
}Recovering from State Corruption#
If the state file is corrupted (invalid JSON, missing resources, incorrect resource addresses):
# Step 1: Download the current (corrupted) state
terraform state pull > corrupted.tfstate
# Step 2: List previous versions (S3 example)
aws s3api list-object-versions \
--bucket myorg-terraform-state \
--prefix production/terraform.tfstate \
--max-items 10
# Step 3: Download the last known good version
aws s3api get-object \
--bucket myorg-terraform-state \
--key production/terraform.tfstate \
--version-id "abc123" \
last-good.tfstate
# Step 4: Verify the old state is valid
terraform show -json last-good.tfstate | jq '.values.root_module.resources | length'
# Step 5: Push the recovered state
terraform state push last-good.tfstate
# Step 6: Run plan to detect any drift since the recovered state
terraform planGotcha: After recovering an older state version, terraform plan will show any changes made between the recovered version and now. Review carefully — some changes may be intentional (applied after the recovered snapshot).
Recovering from Complete State Loss#
If the state file is completely lost (bucket deleted, no backups):
# Option 1: Re-import everything (preferred)
# Write import blocks for all managed resources
# See terraform-import-brownfield article for the full procedure
# Option 2: Start fresh with a new state file
# Terraform will try to create all resources, which will fail because they exist
# For each failure, import the existing resource
terraform apply 2>&1 | grep "already exists"
# Import each conflicting resource
terraform import aws_vpc.main vpc-0abc123
terraform import aws_subnet.public subnet-0abc123
# ... repeat for all resourcesPrevention is better than recovery: Enable versioning on the state bucket, enable cross-region replication, and test state recovery quarterly.
Blue-Green Infrastructure#
Blue-green deployment at the infrastructure level: maintain two complete environments and switch traffic between them.
The Pattern#
DNS / Load Balancer
│
┌──────┴──────┐
│ │
Blue (active) Green (standby)
┌─────────┐ ┌─────────┐
│ VPC │ │ VPC │
│ EKS │ │ EKS │
│ RDS │ │ RDS │
│ App │ │ App │
└─────────┘ └─────────┘Terraform Implementation#
variable "active_environment" {
type = string
default = "blue" # or "green"
}
module "blue" {
source = "./modules/environment"
name = "blue"
vpc_cidr = "10.0.0.0/16"
db_instance = "db.r6g.large"
}
module "green" {
source = "./modules/environment"
name = "green"
vpc_cidr = "10.1.0.0/16"
db_instance = "db.r6g.large"
}
# DNS points to the active environment
resource "aws_route53_record" "app" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = var.active_environment == "blue" ? module.blue.alb_dns_name : module.green.alb_dns_name
zone_id = var.active_environment == "blue" ? module.blue.alb_zone_id : module.green.alb_zone_id
evaluate_target_health = true
}
}Failover: Change active_environment from “blue” to “green” and apply. DNS updates in seconds. The old environment stays running as a rollback target.
Cost concern: Running two complete environments doubles infrastructure cost. For most teams, active-passive (standby environment is smaller or paused) is more practical than active-active.
Active-Passive Variant#
module "primary" {
source = "./modules/environment"
name = "primary"
vpc_cidr = "10.0.0.0/16"
db_instance = "db.r6g.large"
node_count = 3
}
module "standby" {
source = "./modules/environment"
name = "standby"
vpc_cidr = "10.1.0.0/16"
db_instance = "db.r6g.medium" # smaller in standby
node_count = 1 # minimal in standby
}During failover, scale up the standby before switching traffic:
# 1. Scale up standby
terraform apply -var="standby_node_count=3" -var="standby_db_instance=db.r6g.large"
# 2. Wait for scaling to complete
# 3. Switch traffic
terraform apply -var="active_environment=standby"Cross-Region DR with Terraform#
Multi-Region Provider Configuration#
provider "aws" {
alias = "primary"
region = "us-east-1"
}
provider "aws" {
alias = "dr"
region = "us-west-2"
}
# Primary region infrastructure
module "primary" {
source = "./modules/region"
providers = {
aws = aws.primary
}
region = "us-east-1"
vpc_cidr = "10.0.0.0/16"
is_primary = true
}
# DR region infrastructure
module "dr" {
source = "./modules/region"
providers = {
aws = aws.dr
}
region = "us-west-2"
vpc_cidr = "10.1.0.0/16"
is_primary = false
}Cross-Region Data Replication#
# RDS cross-region read replica
resource "aws_db_instance" "dr_replica" {
provider = aws.dr
replicate_source_db = module.primary.db_arn
instance_class = "db.r6g.large"
skip_final_snapshot = false
# Replica becomes standalone primary during failover
# Promotion is a manual or scripted operation, not a Terraform change
}
# S3 cross-region replication
resource "aws_s3_bucket_replication_configuration" "data" {
provider = aws.primary
bucket = module.primary.data_bucket_id
role = aws_iam_role.replication.arn
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = module.dr.data_bucket_arn
}
}
}
# Global Accelerator or Route53 health check for automatic failover
resource "aws_route53_health_check" "primary" {
fqdn = module.primary.endpoint
port = 443
type = "HTTPS"
request_interval = 10
failure_threshold = 3
}
resource "aws_route53_record" "app" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
alias {
name = module.primary.alb_dns_name
zone_id = module.primary.alb_zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "app_dr" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "dr"
alias {
name = module.dr.alb_dns_name
zone_id = module.dr.alb_zone_id
evaluate_target_health = true
}
}Immutable Infrastructure Rebuild#
The strongest DR guarantee: the ability to rebuild everything from code. If you can terraform apply on an empty cloud account and get a working environment, your DR is as good as it gets.
Testing the Rebuild#
# Quarterly DR test: rebuild from scratch in a test account
export AWS_PROFILE=dr-test-account
# Apply all layers in order
cd terraform/networking && terraform init && terraform apply -auto-approve
cd ../identity && terraform init && terraform apply -auto-approve
cd ../database && terraform init && terraform apply -auto-approve
cd ../compute && terraform init && terraform apply -auto-approve
# Run smoke tests against the rebuilt environment
./scripts/smoke-test.sh https://dr-test.example.com
# Tear down after test
cd ../compute && terraform destroy -auto-approve
cd ../database && terraform destroy -auto-approve
cd ../identity && terraform destroy -auto-approve
cd ../networking && terraform destroy -auto-approveWhat Prevents Clean Rebuilds#
| Blocker | Example | Fix |
|---|---|---|
| Hardcoded IDs | subnet_id = "subnet-0abc123" |
Use data sources or terraform_remote_state |
| Manual steps | “Then click Create in the console” | Automate everything in Terraform |
| Undocumented dependencies | “This needs the VPN to be up first” | Add depends_on and document in README |
| External data not in code | Database content, uploaded files | Separate data DR (backups) from infra DR (Terraform) |
| Secret bootstrapping | “First create the KMS key manually” | Use Terraform to create KMS keys, bootstrap Vault |
| DNS delegation | “NS records set in the registrar” | Document manual steps, automate what you can |
| Global unique names | S3 bucket name already taken | Use naming convention with account ID or random suffix |
DR Runbook Template#
Every Terraform-managed environment should have a runbook for common disaster scenarios.
Scenario 1: Single Resource Deleted#
SITUATION: A critical resource was manually deleted (e.g., someone deleted the RDS instance)
STEPS:
1. Do NOT run terraform apply immediately (it will try to recreate, possibly with wrong settings)
2. Run: terraform plan
- If plan shows "will be created": verify the configuration matches what was deleted
- If plan shows errors: the resource may have dependencies that also need attention
3. If the resource had data (RDS, S3): restore from backup FIRST
4. Run: terraform apply
5. Verify the recreated resource is correct
6. Investigate: who deleted it and why (audit logs)Scenario 2: State File Lost#
SITUATION: The state file is missing or corrupted, and no backup version is available
STEPS:
1. Do NOT run terraform apply (it will try to create everything, conflicting with existing resources)
2. Check for state backups:
- S3 versioning: aws s3api list-object-versions --bucket STATE_BUCKET --prefix STATE_KEY
- Azure snapshots: az storage blob list --account-name ACCOUNT --container tfstate --include s
- GCS generations: gsutil ls -la gs://STATE_BUCKET/STATE_KEY
3. If backup found: terraform state push recovered.tfstate
4. If no backup: re-import all resources (see terraform-import-brownfield article)
5. Run: terraform plan — must show "No changes" before proceedingScenario 3: Region Outage#
SITUATION: The primary region is down, need to fail over to DR region
STEPS:
1. Verify primary is actually down (not a monitoring false positive)
2. If automatic failover (Route53 health checks): verify traffic is routing to DR
3. If manual failover:
a. Promote RDS read replica: aws rds promote-read-replica --db-instance-identifier DR_INSTANCE
b. Update DNS: terraform apply -var="active_region=dr"
c. Scale up DR compute: terraform apply -var="dr_node_count=PRODUCTION_SIZE"
4. Verify DR environment is serving traffic correctly
5. After primary recovers:
a. Do NOT immediately fail back (ensure primary is stable)
b. Re-sync data from DR to primary
c. Fail back during a maintenance windowScenario 4: Terraform State Locked#
SITUATION: State lock stuck, blocking all operations
STEPS:
1. Identify who holds the lock:
terraform force-unlock LOCK_ID (shows lock info without unlocking)
2. Check if a Terraform process is still running (CI/CD pipeline, another engineer)
3. If the process crashed:
terraform force-unlock LOCK_ID
4. If the process is still running: wait for it to complete
5. Run: terraform plan — verify state is consistent after the lock is clearedDR Testing Checklist#
Test these quarterly:
- State backup restore — download an older state version, push it, run plan
- Single resource recreation — delete a non-critical resource, let Terraform recreate it
- Cross-region failover — switch traffic to DR region, verify functionality
- Full rebuild — destroy and recreate a non-production environment from code
- Runbook accuracy — walk through each runbook step to verify commands still work
- Secret recovery — verify you can access secrets needed for DR (Vault unsealing, cloud credentials)
- Communication — verify the DR contact list and notification channels are current