Topic Not Found

Who this is for

Machine Learning Engineers who need reproducible cloud environments for experiments, training, and deployment.
Data Scientists transitioning to production ML with minimal DevOps background.
Anyone aiming to standardize and track infrastructure changes for ML pipelines.

Prerequisites

Basic CLI comfort (terminal, environment variables).
High-level understanding of cloud resources: compute, storage, networking.
Optional but helpful: YAML/JSON or a programming language mindset.

Why this matters

In real ML work, you will:

Spin up GPU training machines, storage buckets for datasets, and networks securely and repeatedly.
Create identical dev/staging/prod environments for models and features.
Track and review infrastructure changes like code, enabling code review and rollback.
Control cost by deleting infra reliably after experiments.

Example day-in-the-life tasks

Provision a GPU instance, attach an artifact bucket, and tag resources for cost tracking.
Roll out a feature store or inference endpoint with identical configs across regions.
Audit an infrastructure change via a pull request diff instead of manual clicks.

Concept explained simply

Infrastructure as Code (IaC) means describing your cloud setup in files instead of clicking in the console. You write declarative files that say what you want (e.g., a bucket with versioning, a VM with a GPU), and a tool makes reality match the files.

Mental model

Shopping list model: Your IaC files are a precise shopping list for cloud resources. The tool compares your list with what you already have and adds/changes/removes items to match.
Version control: Because your list is code, you commit it, review it, and roll it back like any software change.
Idempotency: Running the same plan again should not create duplicates. It should converge to the same state.

Core concepts and terms

Declarative vs. imperative: Most IaC tools are declarative (you state the desired end state).
Providers: Plugins that know how to talk to a cloud (e.g., AWS, GCP, Azure).
Resources: The things you create (buckets, instances, networks).
Modules: Reusable building blocks (like a standard GPU instance module for your team).
Variables/Outputs: Inputs to customize modules and outputs to expose useful values (like IP addresses).
State: A file that records what is currently deployed. Use remote, locked state for teams.
Plan/Apply: Plan shows what would change. Apply executes changes.
Drift: When reality changes behind your back (e.g., someone edits in the console). IaC can detect and reconcile.
Tagging/Labels: Attach metadata for ownership, environment, and cost tracking.
Security: Least-privilege roles, secret handling, and no hard-coded credentials in code.

Common tools you'll hear about

Terraform (HCL, multi-cloud)
CloudFormation/SAM (AWS), Bicep (Azure), Deployment Manager (GCP), Pulumi (multi-language)
Kubernetes manifests/Helm (for cluster-level resources)

Worked examples

Example 1: Storage bucket for ML artifacts with versioning and lifecycle

Goal: Version your model artifacts and move older versions to cheaper storage after 30 days.

# terraform { required_version = ">= 1.5.0" }
# provider "aws" { region = var.region }

variable "region" { type = string }
variable "bucket_name" { type = string }

resource "aws_s3_bucket" "artifacts" {
  bucket = var.bucket_name
  tags = {
    project = "ml-platform"
    env     = "dev"
    owner   = "ml-team"
  }
}

resource "aws_s3_bucket_versioning" "this" {
  bucket = aws_s3_bucket.artifacts.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_lifecycle_configuration" "this" {
  bucket = aws_s3_bucket.artifacts.id
  rule {
    id     = "archive-old"
    status = "Enabled"
    noncurrent_version_transition {
      noncurrent_days = 30
      storage_class   = "STANDARD_IA"
    }
  }
}

output "artifact_bucket" { value = aws_s3_bucket.artifacts.bucket }

What to look for: terraform plan shows new bucket, versioning, lifecycle rules. After apply, the bucket name output appears.

Example 2: GPU training instance with restricted SSH and cost tags

Goal: Create a GPU-enabled VM for training, allow SSH only from your IP, and tag it for cost tracking.

variable "my_ip" { type = string } # e.g. "203.0.113.5/32"

resource "aws_security_group" "trainer_sg" {
  name        = "trainer-sg"
  description = "SSH from my IP only"
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.my_ip]
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  tags = { env = "dev", owner = "ml-team" }
}

resource "aws_instance" "gpu_trainer" {
  ami                    = "ami-gpu-placeholder" # use a real GPU AMI
  instance_type          = "g4dn.xlarge"
  vpc_security_group_ids = [aws_security_group.trainer_sg.id]
  tags = {
    Name   = "gpu-trainer-dev"
    env    = "dev"
    owner  = "ml-team"
    cost   = "ml-experiments"
  }
}

output "trainer_id" { value = aws_instance.gpu_trainer.id }

What to look for: plan shows 2 resources to add. After apply, you get the instance ID and can verify restricted SSH.

Example 3: Reusable module for a standard ML bucket

Goal: Create a small module that your team can reuse for datasets, features, and models consistently.

# modules/ml_bucket/variables.tf
variable "name" { type = string }
variable "tags" { type = map(string) default = {} }

# modules/ml_bucket/main.tf
resource "aws_s3_bucket" "this" {
  bucket = var.name
  tags   = var.tags
}
resource "aws_s3_bucket_versioning" "this" {
  bucket = aws_s3_bucket.this.id
  versioning_configuration { status = "Enabled" }
}
output "bucket_name" { value = aws_s3_bucket.this.bucket }

# root/main.tf
module "features_bucket" {
  source = "./modules/ml_bucket"
  name   = "ml-features-dev-123"
  tags   = { env = "dev", owner = "ml-team" }
}

What to look for: consistent conventions across buckets, easy reuse, fewer mistakes.

Step-by-step: from zero to first Terraform stack

Initialize a folder: create main.tf, variables.tf, outputs.tf files.
Pick a provider: configure region and credentials via environment variables. Do not hard-code secrets.
Start with one resource: a storage bucket or a tiny VM.
Run terraform init, then terraform plan to preview changes.
Run terraform apply and verify outputs and tags.
Refactor to a module when you repeat patterns.
Move state to a remote, locked backend before collaborating.
Add a naming convention, tagging policy, and cost labels.

Tip: Remote state and locking (team-ready)

Use a remote backend with locking so two people cannot apply at the same time. Store state in a secure bucket with encryption and enable table/lock mechanisms offered by your cloud. Do not commit state files to git.

Exercises

Do these locally with your preferred cloud account. If you do not have GPU access, still write the code; plan should succeed even if you do not apply.

Exercise 1 — Versioned artifact bucket with lifecycle

Goal: Create an artifact bucket with versioning and a lifecycle rule to transition non-current versions to a cheaper class after 30 days. Tag it with env=dev, owner=ml-team.

Inputs: region, bucket_name
Deliverables: main.tf, variables.tf, outputs.tf

Success criteria checklist:

Plan shows 1 bucket, 1 versioning config, 1 lifecycle rule.
Tags applied.
Output prints bucket name.

Exercise 2 — GPU trainer with restricted SSH

Goal: Create a GPU instance with a security group that allows SSH only from your IP. Add tags: env=dev, owner=ml-team, cost=ml-experiments.

Inputs: my_ip (CIDR), instance_type (default g4dn.xlarge), ami id
Deliverables: main.tf, variables.tf, outputs.tf

Success criteria checklist:

Plan shows a security group and one instance.
Tags present on both resources.
Output prints instance ID.

Mirror of these exercises with solutions is below.

Show solutions

Exercise 1 (sketch)

variable "region" { type = string }
variable "bucket_name" { type = string }

# provider "aws" { region = var.region }

resource "aws_s3_bucket" "artifacts" { bucket = var.bucket_name tags = { env = "dev", owner = "ml-team" } }
resource "aws_s3_bucket_versioning" "v" { bucket = aws_s3_bucket.artifacts.id versioning_configuration { status = "Enabled" } }
resource "aws_s3_bucket_lifecycle_configuration" "lc" {
  bucket = aws_s3_bucket.artifacts.id
  rule { id = "archive" status = "Enabled" noncurrent_version_transition { noncurrent_days = 30 storage_class = "STANDARD_IA" } }
}
output "bucket" { value = aws_s3_bucket.artifacts.bucket }

Exercise 2 (sketch)

variable "my_ip" { type = string }
variable "instance_type" { type = string default = "g4dn.xlarge" }
variable "ami" { type = string }

resource "aws_security_group" "ssh" {
  name = "ssh-only-my-ip"
  ingress { from_port = 22 to_port = 22 protocol = "tcp" cidr_blocks = [var.my_ip] }
  egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] }
  tags = { env = "dev", owner = "ml-team", cost = "ml-experiments" }
}
resource "aws_instance" "gpu" {
  ami = var.ami
  instance_type = var.instance_type
  vpc_security_group_ids = [aws_security_group.ssh.id]
  tags = { Name = "gpu-trainer-dev", env = "dev", owner = "ml-team", cost = "ml-experiments" }
}
output "gpu_id" { value = aws_instance.gpu.id }

Common mistakes and self-checks

Hard-coding secrets: Self-check: No credentials or tokens appear in .tf files or git history.
No remote state: Self-check: State backend is remote and locked before team use.
Skipping plan: Self-check: Always run and review terraform plan in CI or locally.
Drift from console clicks: Self-check: Enforce IaC-only changes; detect drift via plan.
Missing tags: Self-check: All resources include env, owner, and cost tags.
Unclear naming: Self-check: Follow a naming convention like {proj}-{env}-{component}.
Overly permissive networks: Self-check: Security groups use least privilege (e.g., SSH from your IP only).

Quick audit checklist

Plan is clean (no surprise replacements).
State is remote and encrypted.
Modules encapsulate repeatable patterns.
Outputs do not leak secrets.
Destroy command succeeds in dev environments.

Practical projects

Project 1: ML Artifact Platform
- Create buckets for datasets, features, models with a reusable module.
- Add lifecycle and versioning; expose outputs for downstream jobs.
Project 2: Training-on-demand GPU
- Provision a GPU instance with restricted SSH and cost tags.
- Add start/stop scripts and schedule off-hours shutdown using cloud-native schedulers.
Project 3: Staging-to-Prod Promotion
- Use the same module to create dev/staging/prod with different variables.
- Practice promotion via pull requests and plan/apply per environment.

Learning path

Step 1: Learn declarative basics (resources, variables, outputs, plan/apply).
Step 2: Master modules and inputs/outputs for reuse.
Step 3: Configure remote, locked state for collaboration.
Step 4: Add policies: naming, tagging, least privilege IAM.
Step 5: Integrate IaC with CI for plan and apply-on-approval.
Step 6: Manage drift and implement review gates.
Step 7: Expand to multi-environment and module versioning.

Next steps

Turn your examples into modules and publish internally.
Add tests that validate JSON plans or use policy-as-code to enforce tags/roles.
Automate cost guardrails: budgets and alerts tied to tags.

Mini challenge

Design a minimal, reproducible ML training environment:

One versioned artifact bucket.
One GPU instance with restricted SSH and tags.
Outputs: bucket name, instance ID, public IP (if used).
Include a destroy plan and show zero drift on re-apply.

Hint

Start from Example 1 and 2. Add an output for public IP.
Commit, plan, apply, re-apply (expect no changes), then destroy.

Quick Test

Take the quick test below to check your understanding. Everyone can take it for free; only logged-in users get saved progress.

Menu

Infrastructure As Code Basics

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core concepts and terms

Worked examples

Step-by-step: from zero to first Terraform stack

Exercises

Common mistakes and self-checks

Practical projects

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Versioned artifact bucket with lifecycle

Instructions

Expected Output

GPU trainer with restricted SSH and cost tags

Infrastructure As Code Basics — Quick Test

Have questions about Infrastructure As Code Basics?

AI Assistant