Terraform is one of those tools where the entry bar is low but mastery takes years. You can write your first main.tf in an afternoon. But structuring Terraform for a team of engineers managing production infrastructure across multiple AWS accounts is a different problem entirely.
Here’s what I’ve learned after five years and more than a few late-night incident calls.
Module Design: Thin Wrappers vs Fat Modules
The most consequential Terraform decision is how you structure modules. I’ve seen both extremes fail in production.
Thin wrapper modules expose almost every variable of the underlying resource:
module "bucket" {
source = "./modules/s3-bucket"
bucket_name = "my-app-assets"
versioning = true
lifecycle_rules = var.lifecycle_rules
# 20 more variables...
}
The problem: they’re not really modules, they’re just renamed resources with extra steps. You get no meaningful abstraction and a module interface that’s as complex as writing the resource directly.
Fat modules try to own too much:
module "app_platform" {
source = "./modules/app-platform"
# Provisions ECS, RDS, ElastiCache, ALB, Route53, ACM, IAM...
}
The problem: they’re impossible to test, hard to iterate on, and when one resource fails, the entire module is in a bad state.
The Right Abstraction: Business-Domain Modules
The pattern that’s worked best for me is business-domain modules — modules that map to a meaningful unit of infrastructure in your system:
modules/
├── ecs-service/ # An ECS service with its IAM role, log group, and security group
├── rds-postgres/ # An RDS instance with parameter group, subnet group, and enhanced monitoring
├── static-site/ # S3 + CloudFront + ACM + Route53 — the whole CDN stack
└── vpc/ # VPC with subnets, route tables, and VPC endpoints
These modules have opinionated defaults. They encode your org’s standards. An ecs-service module always creates an IAM execution role. A rds-postgres module always enables deletion protection in production.
module "api_service" {
source = "git::https://github.com/myorg/terraform-modules.git//ecs-service?ref=v2.1.0"
name = "payment-api"
image = var.docker_image
cpu = 512
memory = 1024
desired_count = 3
environment = var.environment
}
Clean. The complexity is in the module, not the call site.
State Management: Remote and Isolated
Never use local state in a team environment. You already know this. But how you structure remote state matters more than most people realize.
One state file per logical deployment unit. Not one per environment, not one for everything:
terraform/
├── networking/ ← VPC, subnets, transit gateway
├── platform/ ← ECS cluster, RDS, ElastiCache
├── services/
│ ├── api/ ← The API service and its dependencies
│ └── workers/ ← Background job workers
└── dns/ ← Route53 records (depends on services)
Each directory is an independent Terraform root. This means:
# Breaking change to networking doesn't block a services deploy
cd terraform/services/api && terraform apply
Cross-state references use terraform_remote_state data sources:
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "my-terraform-state"
key = "networking/terraform.tfstate"
region = "us-east-1"
}
}
# Use the VPC ID from the networking state
resource "aws_security_group" "api" {
vpc_id = data.terraform_remote_state.networking.outputs.vpc_id
}
Preventing Drift
Drift — when your actual infrastructure doesn’t match your Terraform state — is the silent killer of IaC practices. It’s insidious because it accumulates gradually and only becomes visible during incidents.
The fix is boring but necessary: scheduled terraform plan runs in CI.
# .github/workflows/drift-detection.yml
on:
schedule:
- cron: '0 8 * * 1-5' # Weekday mornings
jobs:
drift-check:
steps:
- uses: actions/checkout@v3
- name: Terraform Plan (drift check)
run: |
cd terraform/platform
terraform init -backend-config="..."
terraform plan -detailed-exitcode
# Exit code 2 = changes detected = drift
When this pipeline fails, someone investigates. Usually it’s manual console changes by someone “just doing a quick fix.” Those console changes get codified or reverted. The discipline builds over time.
The moved Block Is Underrated
Refactoring Terraform modules used to mean destroying and recreating resources. The moved block (introduced in Terraform 1.1) eliminates that:
moved {
from = aws_s3_bucket.legacy_bucket
to = module.static_site.aws_s3_bucket.assets
}
Terraform understands that the resource didn’t change — it just moved in your configuration. No destroy, no recreate. This makes module refactoring safe enough to actually do.
Keep It Boring
The best Terraform is boring Terraform. Avoid clever meta-arguments, minimize count and for_each on resources with state implications, pin module versions explicitly, and write outputs.tf religiously. The goal is infrastructure that a new team member can read and understand at 3am when something is on fire.