Who this is for
- Machine Learning Engineers who need reproducible cloud environments for experiments, training, and deployment.
- Data Scientists transitioning to production ML with minimal DevOps background.
- Anyone aiming to standardize and track infrastructure changes for ML pipelines.
Prerequisites
- Basic CLI comfort (terminal, environment variables).
- High-level understanding of cloud resources: compute, storage, networking.
- Optional but helpful: YAML/JSON or a programming language mindset.
Why this matters
In real ML work, you will:
- Spin up GPU training machines, storage buckets for datasets, and networks securely and repeatedly.
- Create identical dev/staging/prod environments for models and features.
- Track and review infrastructure changes like code, enabling code review and rollback.
- Control cost by deleting infra reliably after experiments.
Example day-in-the-life tasks
- Provision a GPU instance, attach an artifact bucket, and tag resources for cost tracking.
- Roll out a feature store or inference endpoint with identical configs across regions.
- Audit an infrastructure change via a pull request diff instead of manual clicks.
Concept explained simply
Infrastructure as Code (IaC) means describing your cloud setup in files instead of clicking in the console. You write declarative files that say what you want (e.g., a bucket with versioning, a VM with a GPU), and a tool makes reality match the files.
Mental model
- Shopping list model: Your IaC files are a precise shopping list for cloud resources. The tool compares your list with what you already have and adds/changes/removes items to match.
- Version control: Because your list is code, you commit it, review it, and roll it back like any software change.
- Idempotency: Running the same plan again should not create duplicates. It should converge to the same state.
Core concepts and terms
- Declarative vs. imperative: Most IaC tools are declarative (you state the desired end state).
- Providers: Plugins that know how to talk to a cloud (e.g., AWS, GCP, Azure).
- Resources: The things you create (buckets, instances, networks).
- Modules: Reusable building blocks (like a standard GPU instance module for your team).
- Variables/Outputs: Inputs to customize modules and outputs to expose useful values (like IP addresses).
- State: A file that records what is currently deployed. Use remote, locked state for teams.
- Plan/Apply: Plan shows what would change. Apply executes changes.
- Drift: When reality changes behind your back (e.g., someone edits in the console). IaC can detect and reconcile.
- Tagging/Labels: Attach metadata for ownership, environment, and cost tracking.
- Security: Least-privilege roles, secret handling, and no hard-coded credentials in code.
Common tools you'll hear about
- Terraform (HCL, multi-cloud)
- CloudFormation/SAM (AWS), Bicep (Azure), Deployment Manager (GCP), Pulumi (multi-language)
- Kubernetes manifests/Helm (for cluster-level resources)
Worked examples
Example 1: Storage bucket for ML artifacts with versioning and lifecycle
Goal: Version your model artifacts and move older versions to cheaper storage after 30 days.
# terraform { required_version = ">= 1.5.0" }
# provider "aws" { region = var.region }
variable "region" { type = string }
variable "bucket_name" { type = string }
resource "aws_s3_bucket" "artifacts" {
bucket = var.bucket_name
tags = {
project = "ml-platform"
env = "dev"
owner = "ml-team"
}
}
resource "aws_s3_bucket_versioning" "this" {
bucket = aws_s3_bucket.artifacts.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_lifecycle_configuration" "this" {
bucket = aws_s3_bucket.artifacts.id
rule {
id = "archive-old"
status = "Enabled"
noncurrent_version_transition {
noncurrent_days = 30
storage_class = "STANDARD_IA"
}
}
}
output "artifact_bucket" { value = aws_s3_bucket.artifacts.bucket }
What to look for: terraform plan shows new bucket, versioning, lifecycle rules. After apply, the bucket name output appears.
Example 2: GPU training instance with restricted SSH and cost tags
Goal: Create a GPU-enabled VM for training, allow SSH only from your IP, and tag it for cost tracking.
variable "my_ip" { type = string } # e.g. "203.0.113.5/32"
resource "aws_security_group" "trainer_sg" {
name = "trainer-sg"
description = "SSH from my IP only"
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.my_ip]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = { env = "dev", owner = "ml-team" }
}
resource "aws_instance" "gpu_trainer" {
ami = "ami-gpu-placeholder" # use a real GPU AMI
instance_type = "g4dn.xlarge"
vpc_security_group_ids = [aws_security_group.trainer_sg.id]
tags = {
Name = "gpu-trainer-dev"
env = "dev"
owner = "ml-team"
cost = "ml-experiments"
}
}
output "trainer_id" { value = aws_instance.gpu_trainer.id }
What to look for: plan shows 2 resources to add. After apply, you get the instance ID and can verify restricted SSH.
Example 3: Reusable module for a standard ML bucket
Goal: Create a small module that your team can reuse for datasets, features, and models consistently.
# modules/ml_bucket/variables.tf
variable "name" { type = string }
variable "tags" { type = map(string) default = {} }
# modules/ml_bucket/main.tf
resource "aws_s3_bucket" "this" {
bucket = var.name
tags = var.tags
}
resource "aws_s3_bucket_versioning" "this" {
bucket = aws_s3_bucket.this.id
versioning_configuration { status = "Enabled" }
}
output "bucket_name" { value = aws_s3_bucket.this.bucket }
# root/main.tf
module "features_bucket" {
source = "./modules/ml_bucket"
name = "ml-features-dev-123"
tags = { env = "dev", owner = "ml-team" }
}
What to look for: consistent conventions across buckets, easy reuse, fewer mistakes.
Step-by-step: from zero to first Terraform stack
- Initialize a folder: create main.tf, variables.tf, outputs.tf files.
- Pick a provider: configure region and credentials via environment variables. Do not hard-code secrets.
- Start with one resource: a storage bucket or a tiny VM.
- Run terraform init, then terraform plan to preview changes.
- Run terraform apply and verify outputs and tags.
- Refactor to a module when you repeat patterns.
- Move state to a remote, locked backend before collaborating.
- Add a naming convention, tagging policy, and cost labels.
Tip: Remote state and locking (team-ready)
Use a remote backend with locking so two people cannot apply at the same time. Store state in a secure bucket with encryption and enable table/lock mechanisms offered by your cloud. Do not commit state files to git.
Exercises
Do these locally with your preferred cloud account. If you do not have GPU access, still write the code; plan should succeed even if you do not apply.
Exercise 1 — Versioned artifact bucket with lifecycle
Goal: Create an artifact bucket with versioning and a lifecycle rule to transition non-current versions to a cheaper class after 30 days. Tag it with env=dev, owner=ml-team.
- Inputs: region, bucket_name
- Deliverables: main.tf, variables.tf, outputs.tf
Success criteria checklist:
- Plan shows 1 bucket, 1 versioning config, 1 lifecycle rule.
- Tags applied.
- Output prints bucket name.
Exercise 2 — GPU trainer with restricted SSH
Goal: Create a GPU instance with a security group that allows SSH only from your IP. Add tags: env=dev, owner=ml-team, cost=ml-experiments.
- Inputs: my_ip (CIDR), instance_type (default g4dn.xlarge), ami id
- Deliverables: main.tf, variables.tf, outputs.tf
Success criteria checklist:
- Plan shows a security group and one instance.
- Tags present on both resources.
- Output prints instance ID.
Mirror of these exercises with solutions is below.
Show solutions
Exercise 1 (sketch)
variable "region" { type = string }
variable "bucket_name" { type = string }
# provider "aws" { region = var.region }
resource "aws_s3_bucket" "artifacts" { bucket = var.bucket_name tags = { env = "dev", owner = "ml-team" } }
resource "aws_s3_bucket_versioning" "v" { bucket = aws_s3_bucket.artifacts.id versioning_configuration { status = "Enabled" } }
resource "aws_s3_bucket_lifecycle_configuration" "lc" {
bucket = aws_s3_bucket.artifacts.id
rule { id = "archive" status = "Enabled" noncurrent_version_transition { noncurrent_days = 30 storage_class = "STANDARD_IA" } }
}
output "bucket" { value = aws_s3_bucket.artifacts.bucket }
Exercise 2 (sketch)
variable "my_ip" { type = string }
variable "instance_type" { type = string default = "g4dn.xlarge" }
variable "ami" { type = string }
resource "aws_security_group" "ssh" {
name = "ssh-only-my-ip"
ingress { from_port = 22 to_port = 22 protocol = "tcp" cidr_blocks = [var.my_ip] }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] }
tags = { env = "dev", owner = "ml-team", cost = "ml-experiments" }
}
resource "aws_instance" "gpu" {
ami = var.ami
instance_type = var.instance_type
vpc_security_group_ids = [aws_security_group.ssh.id]
tags = { Name = "gpu-trainer-dev", env = "dev", owner = "ml-team", cost = "ml-experiments" }
}
output "gpu_id" { value = aws_instance.gpu.id }
Common mistakes and self-checks
- Hard-coding secrets: Self-check: No credentials or tokens appear in .tf files or git history.
- No remote state: Self-check: State backend is remote and locked before team use.
- Skipping plan: Self-check: Always run and review terraform plan in CI or locally.
- Drift from console clicks: Self-check: Enforce IaC-only changes; detect drift via plan.
- Missing tags: Self-check: All resources include env, owner, and cost tags.
- Unclear naming: Self-check: Follow a naming convention like {proj}-{env}-{component}.
- Overly permissive networks: Self-check: Security groups use least privilege (e.g., SSH from your IP only).
Quick audit checklist
- Plan is clean (no surprise replacements).
- State is remote and encrypted.
- Modules encapsulate repeatable patterns.
- Outputs do not leak secrets.
- Destroy command succeeds in dev environments.
Practical projects
- Project 1: ML Artifact Platform
- Create buckets for datasets, features, models with a reusable module.
- Add lifecycle and versioning; expose outputs for downstream jobs.
- Project 2: Training-on-demand GPU
- Provision a GPU instance with restricted SSH and cost tags.
- Add start/stop scripts and schedule off-hours shutdown using cloud-native schedulers.
- Project 3: Staging-to-Prod Promotion
- Use the same module to create dev/staging/prod with different variables.
- Practice promotion via pull requests and plan/apply per environment.
Learning path
- Step 1: Learn declarative basics (resources, variables, outputs, plan/apply).
- Step 2: Master modules and inputs/outputs for reuse.
- Step 3: Configure remote, locked state for collaboration.
- Step 4: Add policies: naming, tagging, least privilege IAM.
- Step 5: Integrate IaC with CI for plan and apply-on-approval.
- Step 6: Manage drift and implement review gates.
- Step 7: Expand to multi-environment and module versioning.
Next steps
- Turn your examples into modules and publish internally.
- Add tests that validate JSON plans or use policy-as-code to enforce tags/roles.
- Automate cost guardrails: budgets and alerts tied to tags.
Mini challenge
Design a minimal, reproducible ML training environment:
- One versioned artifact bucket.
- One GPU instance with restricted SSH and tags.
- Outputs: bucket name, instance ID, public IP (if used).
- Include a destroy plan and show zero drift on re-apply.
Hint
- Start from Example 1 and 2. Add an output for public IP.
- Commit, plan, apply, re-apply (expect no changes), then destroy.
Quick Test
Take the quick test below to check your understanding. Everyone can take it for free; only logged-in users get saved progress.