How to learn Cloud Basics for Machine Learning Engineer for free

Why Cloud Basics matter for ML Engineers

Cloud skills let you move from experiments on a laptop to reliable, scalable ML systems. You will store large datasets and models, spin up CPU/GPU compute on demand, control access with IAM, keep costs under control, and monitor performance. These fundamentals unlock faster iteration, reproducibility, and production readiness.

What this helps you do

Host datasets, models, and artifacts in object storage.
Choose the right compute (CPU vs GPU) for each job.
Use managed ML services to speed up training and deployment.
Secure workloads with VPCs, security groups, and least-privilege IAM.
Estimate and cap costs; avoid surprise bills.
Observe logs/metrics to debug and scale.
Automate environments with Infrastructure as Code (IaC).

Who this is for

Aspiring and junior ML Engineers moving workloads to the cloud.
Data Scientists who need scalable experiments and reproducible pipelines.
MLOps practitioners standardizing environments and costs.

Prerequisites

Comfortable with Python and the command line.
Basic ML workflow knowledge (train, eval, save models).
Git basics (clone, commit, push).

Nice-to-have

Docker fundamentals (images, containers).
Familiarity with Linux permissions and environment variables.

Learning path

Milestone 1: Object storage basics — create a bucket, upload/download, set lifecycle to control costs.

Milestone 2: Compute choices — launch CPU for preprocessing; launch GPU for training; use containers for consistency.

Milestone 3: IAM + networking — least-privilege roles, secure service accounts, VPC + security groups.

Milestone 4: Managed ML — run a training job, store artifacts, and register a model.

Milestone 5: Observability — central logs, simple metrics, alerts on failures.

Milestone 6: Cost and quotas — estimate, set budgets, clean up idle resources.

Milestone 7: IaC — define bucket, network, and compute templates to recreate environments.

Tips for success

Start with one project and keep a journal of commands, settings, regions.
Use tags/labels on all resources (project, owner, stage). This improves cost tracking and cleanup.
Automate cleanup early (lifecycle rules, job TTLs).

Worked examples (vendor-agnostic patterns)

1) Store datasets and artifacts in object storage

Pattern: bucket per project, folders for datasets/, models/, and logs/. Example shown with an S3-compatible client; adapt to your cloud.

import boto3
s3 = boto3.client("s3")
BUCKET = "ml-bucket-demo"
# Upload dataset
s3.upload_file("data/train.csv", BUCKET, "datasets/train.csv")
# Download later
s3.download_file(BUCKET, "datasets/train.csv", "./cache/train.csv")
# Set a simple lifecycle to expire logs after 30 days
s3.put_bucket_lifecycle_configuration(
    Bucket=BUCKET,
    LifecycleConfiguration={
        "Rules": [
            {
                "ID": "expire-logs",
                "Status": "Enabled",
                "Filter": {"Prefix": "logs/"},
                "Expiration": {"Days": 30}
            }
        ]
    }
)

Troubleshooting

AccessDenied: verify the role or user has GetObject/PutObject permissions.
NotFound: check bucket name, region, and key prefix.

2) Pick the right compute: CPU vs GPU

Pattern: use CPU for ETL/feature engineering; GPU for deep learning training. Use containers for reproducibility.

# Dockerfile for GPU training
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cu121
COPY train.py /app/train.py
WORKDIR /app
CMD ["python", "train.py"]

# Generic CLI (placeholder) to launch a GPU VM/instance
cloud compute create \
  --name gpu-trainer \
  --machine-type standard-8 \
  --gpu t4:1 \
  --disk 100 \
  --image "your-gpu-image" \
  --tags project=demo,stage=dev

# In train.py, verify GPU and fall back to CPU
import torch
print("CUDA available:", torch.cuda.is_available())

Cost tip

Use spot/preemptible instances for non-critical training to reduce costs, and enable checkpoints to resume.

3) Train with a managed ML service and store artifacts

Pattern: submit a job with environment variables for data bucket and experiment tracking.

# Minimal MLflow usage with remote tracking and S3-like artifact store
import os, mlflow
mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))
mlflow.set_experiment("demo-exp")
with mlflow.start_run():
    mlflow.log_param("model", "logreg")
    mlflow.log_metric("val_auc", 0.912)
    mlflow.log_artifact("models/model.pkl", artifact_path="models")

Practical note

Pass credentials to the job via a dedicated service account with least-privilege access to the artifact bucket.

4) IAM and network basics

Pattern: grant read-only to datasets bucket for training role; restrict inbound traffic with security groups.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::ml-bucket-demo",
        "arn:aws:s3:::ml-bucket-demo/datasets/*"
      ]
    }
  ]
}

# Security group rules (conceptual)
INBOUND:
  - TCP 22 from your-ip/32
  - TCP 443 from 0.0.0.0/0 (if serving HTTPS)
  - Deny all else
OUTBOUND:
  - Allow HTTPS egress for package/model downloads

5) Quick cost estimate and quota check

# Rough cost calculator (Varies by country/company; treat as rough ranges.)
gpu_hour_cost = 0.35  # USD/hour example
hours = 24
storage_gb = 200
storage_cost_per_gb_month = 0.02
compute_cost = gpu_hour_cost * hours
storage_cost = storage_gb * storage_cost_per_gb_month * (hours/720)
print(f"Compute: ${compute_cost:.2f}, Storage: ${storage_cost:.2f}, Total: ${compute_cost+storage_cost:.2f}")

# Quota pseudo-check (replace with your provider's CLI)
# cloud quotas list --service compute --region us

6) Observability: logs and simple metrics

# Structured logging in Python
import json, time

def log(event, **kwargs):
    print(json.dumps({"ts": time.time(), "event": event, **kwargs}))

log("job_start", job_id="exp-123")
# ... training ...
log("metric", name="val_auc", value=0.912)
log("job_end", status="success")

# Simple health endpoint for a model server
from flask import Flask, jsonify
app = Flask(__name__)
@app.route("/health")
def health():
    return jsonify(status="ok")

Alerting idea

Alert when job_end status != success or when metrics vanish for N minutes.

Skill drills

Create a bucket with lifecycle rules that expire logs after 30 days.
Upload a 1 GB dataset and verify checksum after download.
Launch a CPU instance to run a preprocessing script; record runtime and cost.
Launch a GPU instance; verify CUDA availability; stop it immediately after use.
Create a least-privilege role with read-only access to datasets and write access only to models/ path.
Set a budget alert at a low threshold for your project and trigger a test alert.
Deploy a tiny Flask health endpoint and curl it from a private VM via a bastion or secure rule.
Write an IaC template that defines: one bucket, one VM, and a security group with only 22 and 443 open.

Common mistakes and how to fix them

Using public-wide permissions for speed

Use least-privilege IAM. Start read-only; add writes only where needed. Restrict by bucket prefix (e.g., models/ only).

Training in the wrong region

Keep storage and compute in the same region to reduce latency and egress costs.

Forgetting to stop instances

Automate shutdown with job schedulers or TTL scripts. Always tag resources and list running instances before you log off.

GPU driver/CUDA mismatch

Use official CUDA base images and pin versions. Validate torch.cuda.is_available() at startup.

Blocked outbound traffic

If package downloads fail, allow HTTPS egress in your security group or via a NAT gateway.

Mini project: Cloud-native training run

Goal: preprocess data on CPU, train on GPU, store artifacts, and capture metrics/logs.

Provision via IaC: one bucket, a VPC with two subnets, a security group (22/443 only).
Upload dataset to bucket under datasets/ and set lifecycle for logs/.
Run preprocessing on a small CPU instance; write outputs to bucket under features/.
Launch a GPU job using a container image; train for 1–2 epochs; log metrics and save model to models/.
Expose a minimal health endpoint for the model; test from a VM in the same VPC.
Tear down compute; verify costs and clean up temporary files.

Deliverables checklist

IaC template file.
Container image reference and train.py.
Logs/metrics output and final model path.
Post-mortem with costs, issues, and next improvements.

Subskills

Object Storage For Datasets And Artifacts — Buckets, prefixes, lifecycle rules, and integrity checks.
Compute Options Cpu Gpu — When to use CPU vs GPU, containers, and right-sizing.
Managed ML Services Basics — Job submission, experiment tracking, artifact stores.
Networking Vpc And Security Groups Basics — Private networks, inbound/outbound rules, safe defaults.
IAM And Permissions Basics — Users, roles, service accounts, least privilege.
Cost Awareness And Quotas — Budgets, quotas, spot/preemptible usage, lifecycle cleanup.
Observability Stack Basics — Centralized logs, metrics, health checks, simple alerts.
Infrastructure As Code Basics — Reproducible environments via templates and modules.

Next steps

Finish the drills and the mini project to consolidate skills.
Pick one provider and recreate the mini project end-to-end in that environment.
Extend IaC to include a managed database or a feature store.
Add CI to build/push containers and apply IaC automatically on branches.

Menu

Cloud Basics

Table of Contents

Why Cloud Basics matter for ML Engineers

Who this is for

Prerequisites

Learning path

Worked examples (vendor-agnostic patterns)

1) Store datasets and artifacts in object storage

2) Pick the right compute: CPU vs GPU

3) Train with a managed ML service and store artifacts

4) IAM and network basics

5) Quick cost estimate and quota check

6) Observability: logs and simple metrics

Skill drills

Common mistakes and how to fix them

Mini project: Cloud-native training run

Subskills

Next steps

Cloud Basics — Skill Exam

Topics

Object Storage For Datasets And Artifacts

Compute Options Cpu Gpu

Managed ML Services Basics

Networking Vpc And Security Groups Basics

IAM And Permissions Basics

Cost Awareness And Quotas

Infrastructure As Code Basics

Observability Stack Basics

Have questions about Cloud Basics?

AI Assistant