Why Cloud Basics matter for ML Engineers
Cloud skills let you move from experiments on a laptop to reliable, scalable ML systems. You will store large datasets and models, spin up CPU/GPU compute on demand, control access with IAM, keep costs under control, and monitor performance. These fundamentals unlock faster iteration, reproducibility, and production readiness.
What this helps you do
- Host datasets, models, and artifacts in object storage.
- Choose the right compute (CPU vs GPU) for each job.
- Use managed ML services to speed up training and deployment.
- Secure workloads with VPCs, security groups, and least-privilege IAM.
- Estimate and cap costs; avoid surprise bills.
- Observe logs/metrics to debug and scale.
- Automate environments with Infrastructure as Code (IaC).
Who this is for
- Aspiring and junior ML Engineers moving workloads to the cloud.
- Data Scientists who need scalable experiments and reproducible pipelines.
- MLOps practitioners standardizing environments and costs.
Prerequisites
- Comfortable with Python and the command line.
- Basic ML workflow knowledge (train, eval, save models).
- Git basics (clone, commit, push).
Nice-to-have
- Docker fundamentals (images, containers).
- Familiarity with Linux permissions and environment variables.
Learning path
Tips for success
- Start with one project and keep a journal of commands, settings, regions.
- Use tags/labels on all resources (project, owner, stage). This improves cost tracking and cleanup.
- Automate cleanup early (lifecycle rules, job TTLs).
Worked examples (vendor-agnostic patterns)
1) Store datasets and artifacts in object storage
Pattern: bucket per project, folders for datasets/, models/, and logs/. Example shown with an S3-compatible client; adapt to your cloud.
import boto3
s3 = boto3.client("s3")
BUCKET = "ml-bucket-demo"
# Upload dataset
s3.upload_file("data/train.csv", BUCKET, "datasets/train.csv")
# Download later
s3.download_file(BUCKET, "datasets/train.csv", "./cache/train.csv")
# Set a simple lifecycle to expire logs after 30 days
s3.put_bucket_lifecycle_configuration(
Bucket=BUCKET,
LifecycleConfiguration={
"Rules": [
{
"ID": "expire-logs",
"Status": "Enabled",
"Filter": {"Prefix": "logs/"},
"Expiration": {"Days": 30}
}
]
}
)
Troubleshooting
- AccessDenied: verify the role or user has GetObject/PutObject permissions.
- NotFound: check bucket name, region, and key prefix.
2) Pick the right compute: CPU vs GPU
Pattern: use CPU for ETL/feature engineering; GPU for deep learning training. Use containers for reproducibility.
# Dockerfile for GPU training
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cu121
COPY train.py /app/train.py
WORKDIR /app
CMD ["python", "train.py"]
# Generic CLI (placeholder) to launch a GPU VM/instance
cloud compute create \
--name gpu-trainer \
--machine-type standard-8 \
--gpu t4:1 \
--disk 100 \
--image "your-gpu-image" \
--tags project=demo,stage=dev
# In train.py, verify GPU and fall back to CPU
import torch
print("CUDA available:", torch.cuda.is_available())
Cost tip
Use spot/preemptible instances for non-critical training to reduce costs, and enable checkpoints to resume.
3) Train with a managed ML service and store artifacts
Pattern: submit a job with environment variables for data bucket and experiment tracking.
# Minimal MLflow usage with remote tracking and S3-like artifact store
import os, mlflow
mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))
mlflow.set_experiment("demo-exp")
with mlflow.start_run():
mlflow.log_param("model", "logreg")
mlflow.log_metric("val_auc", 0.912)
mlflow.log_artifact("models/model.pkl", artifact_path="models")
Practical note
Pass credentials to the job via a dedicated service account with least-privilege access to the artifact bucket.
4) IAM and network basics
Pattern: grant read-only to datasets bucket for training role; restrict inbound traffic with security groups.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::ml-bucket-demo",
"arn:aws:s3:::ml-bucket-demo/datasets/*"
]
}
]
}
# Security group rules (conceptual)
INBOUND:
- TCP 22 from your-ip/32
- TCP 443 from 0.0.0.0/0 (if serving HTTPS)
- Deny all else
OUTBOUND:
- Allow HTTPS egress for package/model downloads
5) Quick cost estimate and quota check
# Rough cost calculator (Varies by country/company; treat as rough ranges.)
gpu_hour_cost = 0.35 # USD/hour example
hours = 24
storage_gb = 200
storage_cost_per_gb_month = 0.02
compute_cost = gpu_hour_cost * hours
storage_cost = storage_gb * storage_cost_per_gb_month * (hours/720)
print(f"Compute: ${compute_cost:.2f}, Storage: ${storage_cost:.2f}, Total: ${compute_cost+storage_cost:.2f}")
# Quota pseudo-check (replace with your provider's CLI)
# cloud quotas list --service compute --region us
6) Observability: logs and simple metrics
# Structured logging in Python
import json, time
def log(event, **kwargs):
print(json.dumps({"ts": time.time(), "event": event, **kwargs}))
log("job_start", job_id="exp-123")
# ... training ...
log("metric", name="val_auc", value=0.912)
log("job_end", status="success")
# Simple health endpoint for a model server
from flask import Flask, jsonify
app = Flask(__name__)
@app.route("/health")
def health():
return jsonify(status="ok")
Alerting idea
Alert when job_end status != success or when metrics vanish for N minutes.
Skill drills
- Create a bucket with lifecycle rules that expire logs after 30 days.
- Upload a 1 GB dataset and verify checksum after download.
- Launch a CPU instance to run a preprocessing script; record runtime and cost.
- Launch a GPU instance; verify CUDA availability; stop it immediately after use.
- Create a least-privilege role with read-only access to datasets and write access only to models/ path.
- Set a budget alert at a low threshold for your project and trigger a test alert.
- Deploy a tiny Flask health endpoint and curl it from a private VM via a bastion or secure rule.
- Write an IaC template that defines: one bucket, one VM, and a security group with only 22 and 443 open.
Common mistakes and how to fix them
Using public-wide permissions for speed
Use least-privilege IAM. Start read-only; add writes only where needed. Restrict by bucket prefix (e.g., models/ only).
Training in the wrong region
Keep storage and compute in the same region to reduce latency and egress costs.
Forgetting to stop instances
Automate shutdown with job schedulers or TTL scripts. Always tag resources and list running instances before you log off.
GPU driver/CUDA mismatch
Use official CUDA base images and pin versions. Validate torch.cuda.is_available() at startup.
Blocked outbound traffic
If package downloads fail, allow HTTPS egress in your security group or via a NAT gateway.
Mini project: Cloud-native training run
Goal: preprocess data on CPU, train on GPU, store artifacts, and capture metrics/logs.
- Provision via IaC: one bucket, a VPC with two subnets, a security group (22/443 only).
- Upload dataset to bucket under datasets/ and set lifecycle for logs/.
- Run preprocessing on a small CPU instance; write outputs to bucket under features/.
- Launch a GPU job using a container image; train for 1–2 epochs; log metrics and save model to models/.
- Expose a minimal health endpoint for the model; test from a VM in the same VPC.
- Tear down compute; verify costs and clean up temporary files.
Deliverables checklist
- IaC template file.
- Container image reference and train.py.
- Logs/metrics output and final model path.
- Post-mortem with costs, issues, and next improvements.
Subskills
- Object Storage For Datasets And Artifacts — Buckets, prefixes, lifecycle rules, and integrity checks.
- Compute Options Cpu Gpu — When to use CPU vs GPU, containers, and right-sizing.
- Managed ML Services Basics — Job submission, experiment tracking, artifact stores.
- Networking Vpc And Security Groups Basics — Private networks, inbound/outbound rules, safe defaults.
- IAM And Permissions Basics — Users, roles, service accounts, least privilege.
- Cost Awareness And Quotas — Budgets, quotas, spot/preemptible usage, lifecycle cleanup.
- Observability Stack Basics — Centralized logs, metrics, health checks, simple alerts.
- Infrastructure As Code Basics — Reproducible environments via templates and modules.
Next steps
- Finish the drills and the mini project to consolidate skills.
- Pick one provider and recreate the mini project end-to-end in that environment.
- Extend IaC to include a managed database or a feature store.
- Add CI to build/push containers and apply IaC automatically on branches.