What you’ll learn and why it matters
Infrastructure as Code (IaC) lets Platform Engineers define, version, review, and automate cloud resources using code. It reduces manual errors, speeds up delivery, and makes environments reproducible across dev, stage, and prod.
- Spin up consistent environments on demand.
- Use pull requests, code reviews, and CI/CD for infra changes.
- Bake in security and compliance with policies.
- Detect and remediate drift quickly.
- Enable teams with reusable, standards-compliant modules.
Who this is for
- Platform Engineers building and maintaining shared cloud platforms.
- Backend Engineers owning service infrastructure.
- SREs seeking predictable, automated environments.
Prerequisites
- Basic cloud knowledge (e.g., compute, networking, IAM concepts).
- Git fundamentals: branching, PRs, code review.
- CLI comfort (shell, environment variables).
- Optional: CI/CD basics to run plans and applies safely.
Learning path
1) Terraform core
- Install Terraform; learn providers, resources, variables, outputs, state.
- Use workspaces or directory layout for environments.
- Run init/plan/apply/destroy and interpret outputs.
2) Reusable modules and standards
- Create modules with clear inputs/outputs.
- Adopt naming, tagging, and file structure conventions.
- Version modules; add examples and READMEs.
3) Environments: dev, stage, prod
- Separate state and config per environment.
- Use variable files or Terraform Cloud/Workspaces.
- Promote changes from dev → stage → prod via PRs.
4) Networking and IAM as code
- Model VPCs, subnets, routes, SGs, and peering.
- Write least-privilege IAM roles/policies for workloads and CI.
5) Secrets and configuration
- Keep secrets out of state when possible; mark sensitive variables.
- Integrate with secret stores (e.g., SSM Parameter Store, Vault).
- Template app configs with environment-specific values.
6) Policy as Code
- Write policies to enforce tagging, regions, and encryption.
- Fail plans that violate guardrails before they reach prod.
7) Drift detection and remediation
- Detect drift using plans; alert on differences.
- Codify desired state; remove manual changes.
8) Change management
- PR-based plans with mandatory review and policy checks.
- Apply gates: approvals, maintenance windows, change freeze rules.
Worked examples
1) Terraform basics: versioned S3 bucket with outputs
# main.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 5.0"
}
}
}
provider "aws" {
region = var.region
}
resource "aws_s3_bucket" "logs" {
bucket = var.bucket_name
tags = {
env = var.env
owner = var.owner
purpose = "access-logs"
}
}
resource "aws_s3_bucket_versioning" "logs" {
bucket = aws_s3_bucket.logs.id
versioning_configuration { status = "Enabled" }
}
output "bucket_arn" {
value = aws_s3_bucket.logs.arn
}
# variables.tf
variable "region" { type = string }
variable "bucket_name" { type = string }
variable "env" { type = string }
variable "owner" { type = string }
# commands
# terraform init
# terraform plan -var="region=us-east-1" -var="bucket_name=acme-logs-dev" -var="env=dev" -var="owner=platform"
# terraform apply -auto-approve
Result: a versioned bucket with consistent tags and an output you can reuse in other modules.
2) Reusable VPC module (usage example)
# modules/vpc/variables.tf
variable "name" { type = string }
variable "cidr" { type = string }
variable "az_count" { type = number }
# modules/vpc/main.tf (simplified)
resource "aws_vpc" "this" {
cidr_block = var.cidr
tags = { Name = var.name }
}
resource "aws_subnet" "private" {
count = var.az_count
vpc_id = aws_vpc.this.id
cidr_block = cidrsubnet(var.cidr, 4, count.index)
map_public_ip_on_launch = false
tags = { Tier = "private", Name = "${var.name}-priv-${count.index}" }
}
output "vpc_id" { value = aws_vpc.this.id }
output "private_subnet_ids" { value = aws_subnet.private[*].id }
# envs/dev/main.tf
module "vpc" {
source = "../../modules/vpc"
name = "acme-dev"
cidr = "10.10.0.0/16"
az_count = 2
}
Result: a reusable foundation you can version and promote across environments.
3) Least-privilege IAM role for CI to run Terraform
resource "aws_iam_role" "tf_ci" {
name = "tf-ci-role"
assume_role_policy = jsonencode({
Version = "2012-10-17",
Statement = [{
Effect = "Allow",
Principal = { Service = "github-actions.amazonaws.com" },
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_policy" "tf_limited" {
name = "tf-ci-limited"
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{ Effect = "Allow", Action = ["ec2:Describe*", "s3:ListAllMyBuckets"], Resource = "*" },
{ Effect = "Allow", Action = ["s3:PutObject", "s3:GetObject"], Resource = ["arn:aws:s3:::my-tf-state/*"] }
]
})
}
resource "aws_iam_role_policy_attachment" "attach" {
role = aws_iam_role.tf_ci.name
policy_arn = aws_iam_policy.tf_limited.arn
}
Grant the minimal permissions needed for plans, state access, and read-only discovery.
4) Secrets handling with sensitive variables (keep secrets out of state)
# variables.tf
variable "db_password" {
type = string
sensitive = true
}
# main.tf (pass secret to a service without writing it to state)
resource "aws_ssm_parameter" "db_password" {
name = "/acme/${var.env}/db_password"
type = "SecureString"
value = var.db_password
overwrite = true
}
# CLI usage (avoid typing in terminal history)
# export TF_VAR_db_password=$(pbpaste) # or set in CI secret store
# terraform apply -var="env=dev"
Mark variables as sensitive and use a secret store. Avoid logging or outputting secrets.
5) Drift detection and remediation
Detect manual changes by running a plan regularly (in CI or on a schedule):
# steps
# 1) terraform init
# 2) terraform plan -detailed-exitcode
# Exit codes: 0 = no changes, 2 = changes present, 1 = error
# If exit code is 2, alert and open a PR to reconcile or revert manual changes.
Always codify the desired state. If something must be changed urgently, follow up with a PR that updates the code.
6) Policy as Code: deny untagged resources (OPA/Rego example)
package terraform.tags
deny[msg] {
input.resource.kind == "aws_instance"
not input.resource.tags.env
msg := sprintf("Instance %s missing tag 'env'", [input.resource.name])
}
Run policy checks during plan to block resources without required tags. Start with simple rules (tags, regions, encryption) and expand.
Drills and exercises
Common mistakes and debugging tips
Mixing state across environments
Keep separate state backends or workspaces for dev/stage/prod. Name them clearly and restrict access.
Hardcoding values instead of variables
Use variables and tfvars per environment. Hardcoded values block reuse and promotion.
Leaking secrets into state or logs
Mark variables as sensitive, rely on secret stores, and avoid outputs that include secrets. Review CI logs.
Overly permissive IAM policies
Start with read-only and add specific actions as needed. Validate with access advisor and CI policy checks.
Ignoring plan warnings
Warnings often indicate deprecated arguments or potential destructive changes. Fix them before apply.
Manual hotfixes without code updates
Any manual change creates drift. Follow up with a PR that updates code or revert to the desired state.
Mini project: Three-environment microservice platform
- Create a modules folder with: vpc, app_role, service (compute + load balancer), and logging.
- Define envs/dev, envs/stage, envs/prod with separate state backends and tfvars.
- Provision:
- VPC with private subnets and required routing.
- Service module (container or VM) with health checks.
- Least-privilege IAM role for the service to read from a secret store.
- Centralized logs (e.g., to S3/CloudWatch) with retention policies.
- Add a policy that denies resources missing env and owner tags.
- Implement CI: on PR, run terraform fmt, validate, and plan. Require approval before apply.
- Demonstrate promotion: same module versions, different tfvars per environment.
- Simulate drift in dev, detect with plan, and remediate by updating code.
Stretch goals
- Introduce module version pinning and a changelog.
- Add cost tags and a budget alarm resource.
- Create a rollback playbook for failed applies.
Next steps
- Work through the subskills in order (basics → policies → change management).
- Finish the mini project and keep it as a portfolio asset.
- Take the skill exam below to validate your readiness.