What you will learn
MLOps Foundations gives you the core habits, tools, and team practices to take ML from notebooks to reliable production systems. As an MLOps Engineer, this unlocks faster model iterations, safer releases, auditable results, and less firefighting.
Why this matters in the MLOps Engineer role
- Own the ML lifecycle end-to-end: data, training, packaging, deployment, monitoring, and rollback.
- Guarantee reproducibility so teams can trust experiments and debug issues quickly.
- Promote models through Staging → Production with traceable approvals.
- Collaborate smoothly with Data Science and Platform teams using shared standards.
- Reduce risk with automated tests, controlled rollouts, and clear SLOs.
Who this is for
- Engineers moving from data science or software roles into MLOps.
- MLOps/Platform engineers formalizing ML delivery practices.
- Small teams standardizing model release processes.
Prerequisites
- Comfort with Python and the command line.
- Basic Git knowledge (branches, commits, tags).
- Familiarity with containers (Docker) is helpful but not required.
What you can do after this skill
- Reproduce model training runs exactly with fixed seeds, versioned data/code, and locked dependencies.
- Package training and inference environments using Docker or equivalent.
- Set up a basic model registry and promote models through stages with approvals.
- Add smoke tests and risk checks to CI for ML code and artifacts.
- Coordinate handoffs with Data Science and Platform teams using shared artifacts and definitions.
Practical roadmap
Milestone 1 — Reproducibility first
- Pin random seeds and frameworks’ determinism flags.
- Version control code and data references.
- Track experiments with params, metrics, and artifacts.
Milestone 2 — Environments and dependencies
- Lock Python packages with exact versions.
- Create a Docker image for training and inference.
- Automate environment setup in CI.
Milestone 3 — Model promotion
- Introduce a model registry (or tagged storage).
- Define Staging and Production gates with checks.
- Automate promotions with approval steps.
Milestone 4 — Risk and reliability
- Add unit and smoke tests for data schemas and inference.
- Enable canary/blue‑green rollout and quick rollback.
- Define SLOs (latency, error rate) and simple alerts.
Milestone 5 — Collaboration
- Standardize handoff templates and acceptance criteria.
- Publish runbooks for on-call and incidents.
- Document the lifecycle in your repo (README + diagrams).
Mini task: 30‑minute setup
- Create a new repo; add a requirements.txt with exact versions.
- Add a train.py that sets random seeds and prints a metric.
- Commit, tag v0.1, and run the script twice to verify identical results.
Worked examples
1) Reproducible training run with MLflow tracking
This example fixes seeds, logs parameters/metrics, and saves the model artifact. Replace the placeholder model with your framework of choice.
# train.py
import os, random, numpy as np
import mlflow
from mlflow import sklearn as mlflow_sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=SEED)
n_estimators = 100
clf = RandomForestClassifier(n_estimators=n_estimators, random_state=SEED)
clf.fit(X_tr, y_tr)
pred = clf.predict(X_te)
f1 = f1_score(y_te, pred, average="macro")
mlflow.set_experiment("iris_rf")
with mlflow.start_run():
mlflow.log_param("n_estimators", n_estimators)
mlflow.log_metric("f1_macro", f1)
mlflow_sklearn.log_model(clf, "model")
print({"f1_macro": f1})
Why it works
- Seeded randomness ensures determinism.
- Experiment tracking captures params/metrics for auditability.
- Artifacts (model) are saved for later promotion.
2) Locked dependencies and Dockerized training
# requirements.txt (pin exact versions)
mlflow==2.14.1
scikit-learn==1.4.2
numpy==1.26.4
# Dockerfile
FROM python:3.11-slim AS base
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "train.py"]
# Build and run
# docker build -t mlops-foundations:0.1 .
# docker run --rm mlops-foundations:0.1Tips
- Use exact versions to prevent dependency drift.
- Prefer slim base images to reduce size and attack surface.
3) Basic model promotion with registry stages
Concept: register the best run as a model and move through Staging → Production after checks. If you do not have a registry, simulate with storage folders and tags.
# Pseudocode for promotion logic
# 1) Pick best run by metric
# 2) Register or tag artifact as `my_model:staging`
# 3) Run smoke/contract tests
# 4) If pass, retag to `my_model:production`
# Example smoke test (pytest style)
import joblib
import numpy as np
def test_inference_contract():
model = joblib.load("artifacts/model.joblib")
X = np.zeros((1,4))
y = model.predict(X)
assert y.shape == (1,)
4) CI smoke test to catch dependency drift
# ci.yaml (generic CI outline)
name: ci
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
run: |
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt pytest
- name: Smoke tests
run: |
. .venv/bin/activate
pytest -q -k "contract or smoke"
What it prevents
- Silent package updates breaking inference.
- Schema mismatches after data changes.
5) Safe rollout and rollback switch
# Pseudocode: env toggle for canary
import os
MODEL_STAGE = os.getenv("MODEL_STAGE", "staging")
if MODEL_STAGE == "production":
model_path = "registry/my_model/production/model.joblib"
else:
model_path = "registry/my_model/staging/model.joblib"
# Rollback in minutes
# 1) Set MODEL_STAGE=production to promote.
# 2) If issues appear, set MODEL_STAGE=staging or previous tag.
Drills and exercises
- Create a tiny dataset and verify a training script produces identical metrics across two runs.
- Lock dependencies and rebuild your Docker image; record image size and build time.
- Write a smoke test that loads a saved model and predicts on a 1‑row input.
- Tag a model as staging, run tests, then retag as production.
- Add an environment variable toggle to your inference service and practice a rollback.
Common mistakes and debugging tips
- Unpinned dependencies: Always pin exact versions; re-run old runs with the same lockfile.
- Non-deterministic training: Set seeds and framework determinism flags; document them.
- Data drift surprises: Version input data (or its hash) and log dataset identifiers with each run.
- Registry bypass: Never deploy artifacts not registered or tagged; enforce gates in CI.
- Missing smoke tests: At minimum, include a load-and-predict test in CI.
- Opaque failures: Keep run logs and metrics; include model version and git commit in logs.
Quick debug checklist
- What commit, data version/hash, and dependency lockfile were used?
- Are seeds set and stable?
- Did CI smoke tests pass on the exact artifact you deployed?
- Can you reproduce locally in the Docker image?
Mini project: From notebook to safe release
Goal: Turn a simple classifier into a production-ready, testable release with promotion and rollback.
- Reproduce: Convert a notebook into train.py with fixed seeds; track params/metrics; save artifact.
- Package: Lock dependencies and build a Docker image for training and inference.
- Test: Add smoke tests (load model, predict on 1 row) and a data schema check.
- Promote: Tag the best artifact as staging; run tests; then tag as production.
- Operate: Add an env toggle to switch stages; simulate a rollback by switching back.
Acceptance criteria
- Two identical runs produce the same metric within tolerance.
- CI passes smoke tests on the exact artifact to be promoted.
- Production toggle and rollback verified in under 5 minutes.
Practical projects
- Reproducible training template: A repo with seeds, tracking, tests, and Docker image that teammates can clone and run identically.
- Lightweight model registry: Folder/tag-based registry with Staging/Production and a promotion script.
- ML smoke-testing suite: Contract tests and data schema checks runnable in CI for any model artifact.
Learning path
- Start with Reproducibility Principles and Environments And Dependencies; build a reproducible training template.
- Add Model Promotion Practices over a simple registry and define approval gates.
- Adopt a Risk And Reliability Mindset with smoke tests, toggles, and rollback drills.
- Practice ML Lifecycle Ownership end-to-end with a mini project.
- Improve Collaboration With DS And Platform Teams by adopting shared artifacts and runbooks.
Next steps
- Extend CI to run small training jobs on pull requests to catch breaking changes early.
- Add simple monitoring (latency, error rate, basic data drift checks) to your inference service.
- Document runbooks for incidents and on-call handoff.