Why I Stopped Writing Terraform Modules and What I Use Instead
I didn’t set out to build 80 Terraform modules. Nobody does. It just happens — you write a VPC module, then someone needs a slightly different subnet layout for staging, so you add a variable. Then another. Then someone asks for optional NAT gateways. Two years later you’re maintaining 2,400+ variable definitions across 80 modules and wondering where your weeks go.
I wrote about this on LinkedIn a while back and the number of “are you inside my codebase?” replies told me this isn’t just my problem. And it gets worse when you’re multi-cloud — we run both AWS and GCP, which means the module sprawl was happening in two ecosystems at once, with different provider quirks, different state backends, and different networking primitives. So I want to walk through what actually went wrong with the traditional module approach, the pattern my team uses instead, and why I think now — with Terragrunt 1.0 — is the right time to make the switch if you’ve been on the fence.
How Module Proliferation Actually Happens
Terraform modules solve a real problem: code reuse. When you have the same VPC configuration deployed across 12 environments, you don’t want to copy-paste 200 lines of HCL twelve times. Modules let you define the resource once and parameterize it. So far, so good.
The trouble starts when you try to make modules handle configuration management too. Your VPC module works great for dev. But prod needs different CIDR ranges, different subnet tiers, different NAT gateway setups, flow log settings that comply with your security team’s requirements. So you add variables. Then conditionals. Then for_each loops over maps of maps.
Here’s what a “flexible” VPC module’s variables file looks like after two years of organic growth:
variable "vpc_name" {}
variable "cidr_block" {}
variable "enable_dns_support" { default = true }
variable "enable_dns_hostnames" { default = true }
variable "enable_nat_gateway" { default = false }
variable "single_nat_gateway" { default = false }
variable "one_nat_gateway_per_az" { default = false }
variable "enable_vpn_gateway" { default = false }
variable "enable_flow_log" { default = false }
variable "flow_log_destination_type" { default = "cloud-watch-logs" }
variable "flow_log_retention_days" { default = 14 }
variable "public_subnets" { default = [] }
variable "private_subnets" { default = [] }
variable "database_subnets" { default = [] }
variable "elasticache_subnets" { default = [] }
variable "create_database_subnet_group" { default = true }
variable "create_elasticache_subnet_group" { default = true }
variable "public_subnet_tags" { default = {} }
variable "private_subnet_tags" { default = {} }
variable "tags" { default = {} }
# ... 15 more variables
That’s 30+ variables for a single module. Multiply by 80 modules and you’re looking at 2,400+ variable definitions. Each one is a potential misconfiguration. Each combination needs to be tested — and mostly isn’t.
But the real cost isn’t writing the module. It’s versioning it. Every change to a shared module triggers a version bump. Every bump requires every consumer to update their module source reference. For a module consumed by 20 stacks, a single variable addition means 20 PRs, 20 plan reviews, and 20 applies. I’ve lost entire sprints to this cycle.
The Thing I Got Wrong for Two Years
It took me embarrassingly long to see the actual problem. We weren’t bad at writing modules — we were using modules to solve something they were never designed for.
Terraform modules conflate two separate concerns:
- Resource definition — what cloud resources should exist and how they’re wired together
- Configuration management — how those resources differ across environments, regions, and compliance domains
When you use a single module to handle both, you end up with a “flexible” module that’s really just a thin wrapper around the provider resource with extra variables on top. If your module exposes a variable for every argument the underlying resource accepts, ask yourself: what value is the module actually providing? You’ve moved the configuration from one file to another and added version management overhead for the privilege.
Here’s how I think about the split now:
- What gets created? What’s connected to what? → Terraform root module
- What CIDR range? What region? How many replicas? → Terragrunt
- What order to deploy? What depends on what? → Terragrunt dependency blocks
- Where does state live? How is it partitioned? → Terragrunt
remote_state
Terraform modules are excellent at resource definition. They’re terrible at configuration management. Once I separated these concerns, everything got simpler.
The Pattern: Opinionated Root Modules + Terragrunt
Here’s what we use instead of the 80-module library.
Small, Opinionated Root Modules
Each root module handles exactly one resource type or one tightly-coupled group of resources. No Swiss Army knives. No 30-variable interfaces. The code examples below are GCP because that’s where I’ll show the Terragrunt wiring — but the same pattern drives our AWS side too (VPCs, Transit Gateway, EKS, RDS), and you can see both in the Multi-Cloud Runway repo.
# modules/vpc/main.tf
resource "google_compute_network" "vpc" {
name = var.network_name
auto_create_subnetworks = false
routing_mode = "GLOBAL"
project = var.project_id
}
resource "google_compute_subnetwork" "subnets" {
for_each = var.subnets
name = each.value.name
ip_cidr_range = each.value.cidr
region = each.value.region
network = google_compute_network.vpc.id
project = var.project_id
dynamic "secondary_ip_range" {
for_each = lookup(each.value, "secondary_ranges", [])
content {
range_name = secondary_ip_range.value.range_name
ip_cidr_range = secondary_ip_range.value.ip_cidr_range
}
}
}
# modules/vpc/variables.tf
variable "project_id" {
type = string
}
variable "network_name" {
type = string
}
variable "subnets" {
type = map(object({
name = string
cidr = string
region = string
secondary_ranges = optional(list(object({
range_name = string
ip_cidr_range = string
})), [])
}))
}
That’s the entire module. Three variables. No conditional logic. No feature flags. No optional resources gated behind boolean variables. It creates a VPC and subnets — nothing more.
Terragrunt for the Configuration Layer
Environment differences live in terragrunt.hcl, not in variable permutations:
# infrastructure/prod/network/terragrunt.hcl
terraform {
source = "git::https://github.com/cloudon-one/terraform-google-modules.git//modules/vpc?ref=v1.2.0"
}
include "root" {
path = find_in_parent_folders("root.hcl")
}
dependency "folders" {
config_path = "../../folders"
}
inputs = {
project_id = dependency.folders.outputs.network_project_id
network_name = "prod-vpc"
subnets = {
gke = {
name = "prod-gke-subnet"
cidr = "10.0.0.0/20"
region = "us-central1"
secondary_ranges = [
{ range_name = "pods", ip_cidr_range = "10.4.0.0/14" },
{ range_name = "services", ip_cidr_range = "10.8.0.0/20" }
]
}
data = {
name = "prod-data-subnet"
cidr = "10.1.0.0/24"
region = "us-central1"
}
}
}
And the dev environment:
# infrastructure/dev/network/terragrunt.hcl
terraform {
source = "git::https://github.com/cloudon-one/terraform-google-modules.git//modules/vpc?ref=v1.2.0"
}
include "root" {
path = find_in_parent_folders("root.hcl")
}
dependency "folders" {
config_path = "../../folders"
}
inputs = {
project_id = dependency.folders.outputs.network_project_id
network_name = "dev-vpc"
subnets = {
gke = {
name = "dev-gke-subnet"
cidr = "10.128.0.0/20"
region = "us-central1"
secondary_ranges = [
{ range_name = "pods", ip_cidr_range = "10.132.0.0/14" },
{ range_name = "services", ip_cidr_range = "10.136.0.0/20" }
]
}
}
}
Same module, different configuration. The module doesn’t know or care whether it’s prod or dev — it just creates what it’s told to create.
Dependency Blocks Instead of Module Nesting
In the old setup, we nested module calls inside other modules to compose infrastructure. A “platform” module would call a VPC module, a GKE module, and a Cloud SQL module internally. This creates deeply nested state, makes targeted applies painful, and turns debugging into archaeology — you’re three levels deep in a module before you find the resource that’s actually broken.
Terragrunt’s dependency blocks replace this with explicit wiring between standalone components:
# infrastructure/prod/gke/terragrunt.hcl
dependency "network" {
config_path = "../network"
}
inputs = {
network = dependency.network.outputs.network_name
subnetwork = dependency.network.outputs.subnet_names["gke"]
project_id = dependency.network.outputs.project_id
}
Each component has its own state file, its own plan/apply lifecycle, and explicit inputs from its dependencies. I can plan the GKE cluster without re-planning the VPC. I can apply the VPC without touching the GKE cluster. I can destroy the GKE cluster without the network state getting involved.
From 80 Modules to 20 Root Configs
I want to be upfront about what “80 to 20” actually means. We didn’t just delete 60 modules. What happened is that most of our 80 modules were near-duplicates or thin wrappers — slight variations of the same resource with different variable sets bolted on for different environments. Once Terragrunt took over the configuration layer, those variations collapsed. The 20 that remained are genuinely distinct resource definitions.
Here’s what the shift looked like in practice:
- Module count: 80+ down to 20 root configs
- Variables per module: 30+ down to 5-8
- Version bumps per month: 40+ down to 5-10
- Time to add a new environment: 2-3 days down to 2-3 hours
- Cross-stack dependencies: Manual and error-prone, now automatic via dependency blocks
- State file granularity: Monolithic per environment, now per component
- Blast radius of a change: Entire environment, now single component
The “time to add a new environment” improvement was the most dramatic, and also the one my team noticed first. Previously, adding an environment meant creating PRs against 20+ module consumers, coordinating the review cycle, and sequencing the applies. Now it’s a new terragrunt.hcl with different values. That’s it.
Getting Started
If you’re sitting on a bloated module library and want to migrate, here’s the path I’d recommend. Don’t try to do it all at once — we didn’t.
Step 1: Find Your Wrapper Modules
Run this across your modules directory:
for dir in modules/*/; do
vars=$(grep -c '^variable' "$dir/variables.tf" 2>/dev/null || echo 0)
resources=$(grep -c '^resource' "$dir/main.tf" 2>/dev/null || echo 0)
echo "$dir: $vars variables, $resources resources"
done | sort -t: -k2 -rn
This is a rough heuristic, not a definitive test. But any module where the variable count exceeds the resource count by 3x or more is probably doing configuration management, not meaningful abstraction. Use it as a starting point to identify migration candidates.
Step 2: Set Up the Terragrunt Directory Structure
infrastructure/
root.hcl # Shared remote state + provider config
prod/
network/
terragrunt.hcl # VPC config for prod
gke/
terragrunt.hcl # GKE config for prod
dev/
network/
terragrunt.hcl # VPC config for dev
The root.hcl handles remote state and provider configuration once:
# root.hcl
remote_state {
backend = "gcs"
config = {
bucket = "my-terraform-state"
prefix = "${path_relative_to_include()}"
project = "my-state-project"
location = "us-central1"
}
}
generate "provider" {
path = "provider.tf"
if_exists = "overwrite"
contents = <<EOF
provider "google" {
project = var.project_id
region = "us-central1"
}
EOF
}
Step 3: Migrate One Stack at a Time
Pick one non-critical module. Create the Terragrunt wrapper. Import existing state. Validate with terragrunt plan showing no changes. Then move to the next one.
cd infrastructure/dev/network
terragrunt plan
# Should show: No changes. Infrastructure is up-to-date.
We started with the dev VPC and worked our way up to prod over about three weeks. Don’t be a hero — if the first plan shows unexpected diffs, stop and figure out why before moving on.
Step 4: Scale Across Environments
Once the first stack works, replicate the pattern. Adding a new environment is now creating a new terragrunt.hcl with different values — not modifying a module, bumping a version, and coordinating downstream consumers.
For compliance-heavy organizations running multiple domains — commercial, PCI, HIPAA, FedRAMP — this pattern scales particularly well. Each domain gets its own directory tree with domain-specific configuration, all pointing to the same small set of root modules. Constraints like “FedRAMP requires US-only regions” or “PCI needs isolated DNS zones” are expressed in the Terragrunt configuration, not baked into module conditionals where they’d create branching logic that’s hard to audit.
If you want to see what this looks like end-to-end across both AWS and GCP — with Organization/OU structures, Transit Gateway, Shared VPC, GKE, IAM, and security guardrails already wired up — I’ve open-sourced the Multi-Cloud Runway template. It’s a complete landing zone scaffold with separate aws-terragrunt-configuration/ and gcp-terragrunt-configuration/ trees, centralized vars.yaml per cloud, and compliance validation scripts baked in. You can fork it and have a working multi-account, multi-region, security-hardened skeleton in an afternoon instead of spending weeks figuring out the directory layout.
What We Lost (Being Honest)
I’d be lying if I said there were no trade-offs.
Onboarding takes a beat longer. New engineers need to understand Terragrunt’s include/dependency model on top of Terraform. It’s not complicated, but it’s one more thing. We wrote an internal “how we do IaC” doc and that mostly solved it.
IDE support is weaker. Terraform has first-class IDE plugins. Terragrunt HCL gets partial support at best. You lose some autocomplete and jump-to-definition convenience. In practice this matters less than I expected — the files are short enough that you rarely need IDE navigation.
run-all can surprise you. Running terragrunt run-all plan across 40 stacks is powerful, but when it fails halfway through due to a transient auth issue, the error output is… not great. We learned to run it with --terragrunt-parallelism 4 and break large runs into domain-scoped batches.
None of these were dealbreakers. But if someone tells you there are zero downsides to adopting Terragrunt, they’re selling you something.
Why Terragrunt 1.0 Makes This the Right Time
I’ll be direct: I would have been more hesitant to recommend this pattern two years ago. Terragrunt was a tool with no stability promises — the configuration DSL could change between minor versions, and upgrades sometimes meant rewriting chunks of your HCL.
Terragrunt 1.0 changed that. With semantic versioning guarantees, the terragrunt.hcl files I write today will work with future 1.x releases. That matters when you’re running this in production across compliance domains.
The features that actually matter for this pattern:
- Units — first-class support for the “one component per directory” model I’ve been describing. This went from a community convention to a supported concept.
- Stacks — orchestrate groups of units with explicit dependency ordering, which is exactly the
run-allworkflow but with better control. - Engine mode — persistent Terraform process pool that speeds up plan/apply cycles across many stacks. Noticeable difference when you have 40+ components.
Common Objections
“Terragrunt is another tool to learn.”
Fair. But be honest with yourself — is your module library really simpler? I had engineers who’d been on the team six months and still didn’t know which variables were safe to change in our VPC module. Terragrunt’s entire config surface is smaller than the interface of our old network module alone.
“My modules have real logic — conditionals, loops, dynamic blocks.”
If the logic is about resource composition (“create a NAT gateway and route table and wire them together”), keep it in a module. That’s genuine abstraction. If the logic is about environment configuration (“use single NAT in dev, one-per-AZ in prod”), move it to Terragrunt. Most teams have more of the second kind than they think.
“Terraform workspaces solve this.”
Workspaces solve state isolation. They don’t solve configuration management, dependency orchestration, or state backend configuration. You still end up with the same conditional logic inside modules to handle per-workspace differences — var.environment == "prod" ? 3 : 1 scattered throughout your codebase.
“OpenTofu will fix this.”
OpenTofu addresses licensing concerns and is adding useful features like early variable evaluation and for_each on providers. But it doesn’t solve the configuration-management-via-modules problem. You’ll still need something to handle per-environment values, dependency ordering, and state backend configuration. Whether you run Terraform or OpenTofu underneath, Terragrunt sits on top. I’m watching OpenTofu closely, but it’s solving a different problem than the one this article is about.
Wrapping Up
Look, your 80-module repo isn’t a library. It’s technical debt that accumulated one “just add a variable” at a time. I know because I built one. The fix isn’t better modules — it’s separating resource definition from configuration management. Let Terraform do what it’s good at: defining resources. Let Terragrunt handle the rest.
If you’re spending more time managing module versions than building infrastructure, take a weekend and try the migration on one non-critical stack. For the full picture — both clouds, compliance frameworks, CI/CD pipeline, and the exact directory layout described in this article — start with the Multi-Cloud Runway template. It has production-grade AWS and GCP landing zones ready to fork. For the individual pieces, the CloudOn GCP Terraform modules show the small, opinionated root module pattern, and the CloudOn GCP Landing Zone shows the Terragrunt configuration layer on the GCP side. Fork them, adapt them, and see if your Mondays get better.