Terraform is one of those tools where the entry bar is low but mastery takes years. You can write your first main.tf in an afternoon. But structuring Terraform for a team of engineers managing production infrastructure across multiple AWS accounts is a different problem entirely.

Here’s what I’ve learned after five years and more than a few late-night incident calls.

Module Design: Thin Wrappers vs Fat Modules

The most consequential Terraform decision is how you structure modules. I’ve seen both extremes fail in production.

Thin wrapper modules expose almost every variable of the underlying resource:

module "bucket" {
  source = "./modules/s3-bucket"

  bucket_name     = "my-app-assets"
  versioning      = true
  lifecycle_rules = var.lifecycle_rules
  # 20 more variables...
}

The problem: they’re not really modules, they’re just renamed resources with extra steps. You get no meaningful abstraction and a module interface that’s as complex as writing the resource directly.

Fat modules try to own too much:

module "app_platform" {
  source = "./modules/app-platform"
  # Provisions ECS, RDS, ElastiCache, ALB, Route53, ACM, IAM...
}

The problem: they’re impossible to test, hard to iterate on, and when one resource fails, the entire module is in a bad state.

The Right Abstraction: Business-Domain Modules

The pattern that’s worked best for me is business-domain modules — modules that map to a meaningful unit of infrastructure in your system:

modules/
├── ecs-service/       # An ECS service with its IAM role, log group, and security group
├── rds-postgres/      # An RDS instance with parameter group, subnet group, and enhanced monitoring
├── static-site/       # S3 + CloudFront + ACM + Route53 — the whole CDN stack
└── vpc/               # VPC with subnets, route tables, and VPC endpoints

These modules have opinionated defaults. They encode your org’s standards. An ecs-service module always creates an IAM execution role. A rds-postgres module always enables deletion protection in production.

module "api_service" {
  source = "git::https://github.com/myorg/terraform-modules.git//ecs-service?ref=v2.1.0"

  name           = "payment-api"
  image          = var.docker_image
  cpu            = 512
  memory         = 1024
  desired_count  = 3
  environment    = var.environment
}

Clean. The complexity is in the module, not the call site.

State Management: Remote and Isolated

Never use local state in a team environment. You already know this. But how you structure remote state matters more than most people realize.

One state file per logical deployment unit. Not one per environment, not one for everything:

terraform/
├── networking/     ← VPC, subnets, transit gateway
├── platform/       ← ECS cluster, RDS, ElastiCache
├── services/
│   ├── api/        ← The API service and its dependencies
│   └── workers/    ← Background job workers
└── dns/            ← Route53 records (depends on services)

Each directory is an independent Terraform root. This means:

# Breaking change to networking doesn't block a services deploy
cd terraform/services/api && terraform apply

Cross-state references use terraform_remote_state data sources:

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}

# Use the VPC ID from the networking state
resource "aws_security_group" "api" {
  vpc_id = data.terraform_remote_state.networking.outputs.vpc_id
}

Preventing Drift

Drift — when your actual infrastructure doesn’t match your Terraform state — is the silent killer of IaC practices. It’s insidious because it accumulates gradually and only becomes visible during incidents.

The fix is boring but necessary: scheduled terraform plan runs in CI.

# .github/workflows/drift-detection.yml
on:
  schedule:
    - cron: '0 8 * * 1-5'  # Weekday mornings

jobs:
  drift-check:
    steps:
      - uses: actions/checkout@v3
      - name: Terraform Plan (drift check)
        run: |
          cd terraform/platform
          terraform init -backend-config="..."
          terraform plan -detailed-exitcode
        # Exit code 2 = changes detected = drift

When this pipeline fails, someone investigates. Usually it’s manual console changes by someone “just doing a quick fix.” Those console changes get codified or reverted. The discipline builds over time.

The moved Block Is Underrated

Refactoring Terraform modules used to mean destroying and recreating resources. The moved block (introduced in Terraform 1.1) eliminates that:

moved {
  from = aws_s3_bucket.legacy_bucket
  to   = module.static_site.aws_s3_bucket.assets
}

Terraform understands that the resource didn’t change — it just moved in your configuration. No destroy, no recreate. This makes module refactoring safe enough to actually do.

Keep It Boring

The best Terraform is boring Terraform. Avoid clever meta-arguments, minimize count and for_each on resources with state implications, pin module versions explicitly, and write outputs.tf religiously. The goal is infrastructure that a new team member can read and understand at 3am when something is on fire.