Who this is for
This lesson is for MLOps Engineers and ML practitioners who want a reliable, repeatable way to take ML code from commit to production. If you touch data pipelines, model training, or model-serving systems, this will help you ship faster with fewer surprises.
Prerequisites
- Basic Git and branching (e.g., trunk-based or feature branches).
- Comfort with containers (Docker) and Python packaging.
- Familiarity with unit testing and CI basics.
- Understanding of model evaluation metrics for your problem (e.g., AUC, RMSE, latency).
Why this matters
Real MLOps tasks that rely on a solid build–test–release workflow:
- Shipping a new model version behind a safe gate so bad models never reach users.
- Rolling back quickly when a canary shows degraded metrics or rising errors.
- Reproducing a model from an audit request using pinned data, code commit, and environment.
- Safely updating a feature pipeline schema without breaking real-time services.
Concept explained simply
Build–Test–Release is your conveyor belt from code to production. You package everything needed to run (build), prove it works and meets quality bars (test), then roll it out safely (release). In ML, you also verify data health, evaluation metrics, and fairness, not only code.
Mental model
Imagine three gates:
- Gate 1: Build creates immutable artifacts: code wheel, Docker image, and optionally a trained candidate model on a small sample.
- Gate 2: Test checks code correctness, data contracts, training on a sample, and performance against a baseline. Only artifacts that pass move forward.
- Gate 3: Release promotes a version to staging and then production with progressive rollout and automatic rollback triggers.
Build phase (produce immutable, reproducible artifacts)
- Lock environments: use Dockerfile with pinned versions and a requirements lock (e.g., pip freeze).
- Package code: build a Python wheel or similar artifact; include a short metadata file (commit SHA, timestamp).
- Create runtime image: build a Docker image for training/serving with labels for commit SHA and version.
- Optional lightweight training: train on a small, fixed dataset snapshot for quick sanity metrics and to catch obvious regressions early.
- Store artifacts: push to artifact registry and tag with commit SHA and semantic version.
Build checklist
- Deterministic Docker build (no unpinned “latest”).
- Wheel built from a clean environment.
- Commit SHA embedded in image labels and model metadata.
- Small sample dataset snapshot stored and versioned.
Test phase (prove quality and safety)
Run fast tests first, slow tests later. Typical layers:
- Static and unit tests: linting, import tests, feature transformations, loss functions.
- Data quality tests: schema, ranges, nulls, categorical domain; validate both training data snapshot and live sample.
- Integration tests: training on a small sample, evaluation against a baseline, threshold gates (e.g., AUC must not drop by more than 1%).
- Security and compliance: dependency scan, secret scan.
- Performance smoke: micro-benchmarks and serving latency on a small input set.
Typical gates
- Metrics: candidate >= champion - tolerance (e.g., AUC drop ≤ 0.01).
- Latency: P95 inference latency under defined limit on staging.
- Fairness: disparity across key segments below threshold.
- Data: no breaking schema changes without migration flag.
Release phase (promote, roll out, and watch)
- Promote: push the approved artifact to staging. Run end-to-end smoke and replay traffic if possible.
- Progressive delivery: canary or shadow for online services; batch models run on a small slice first.
- Observability: watch technical (errors, latency) and model metrics (drift, performance).
- Auto rollback: define clear health checks and thresholds that trigger rollback.
- Human-in-the-loop: require approval for promotions that affect high-risk areas.
Release checklist
- Artifact version and commit SHA recorded in deployment config.
- Automatic rollback policy defined and tested.
- Dashboards and alerts configured for both system and model metrics.
- Approval step set for production promotion.
Worked examples
Example 1: Batch churn model (daily job)
- Build: create wheel and training Docker image with pinned libs; embed commit SHA.
- Test: run unit tests; validate data snapshot (schema, ranges); train on 5% sample; compare AUC to champion (tolerance 0.01).
- Release: if pass, register candidate model v1.12 with metadata (dataset hash, metrics). Promote to staging, score a 10% user sample, verify lift. Approve and run full batch overnight. Auto-rollback if next-day monitoring shows lift below baseline or data drift above threshold.
Example 2: Real-time recommendations service
- Build: service Docker image; include model v3.4 and feature transform code; lock OS and Python deps.
- Test: contract tests for request/response schema; latency micro-benchmark; sample inference parity vs offline pipeline.
- Release: canary 10% traffic for 30 minutes; health checks on error rate, P95 latency, click-through rate proxy. Roll forward to 100% if stable; otherwise rollback automatically.
Example 3: Feature pipeline schema change
- Build: new feature code packaged; migration script included.
- Test: data contract checks (no dropped columns without deprecation flag); backfill on a small historical window; training-on-sample confirms metrics parity.
- Release: dual-write old and new features in staging; shadow serve; after stability, switch consumers to new schema and remove old after deprecation window.
Exercises
Do these now. Then take the Quick Test below. Note: the Quick Test is available to everyone; sign in to save your progress.
Exercise 1: Draft a pipeline
Create a minimal Build–Test–Release pipeline outline for an ML service. Fill in key steps and gates.
Starter template
pipeline:
build:
- step: build python wheel (pin versions)
- step: build docker image (label with commit SHA)
- step: store artifacts (artifact registry)
test:
- step: run unit tests and lint
- step: data contract checks (schema, ranges)
- step: train on sample (5%)
- gate: metric >= champion - 0.01
- step: dependency and secret scan
release:
- step: deploy to staging
- step: smoke tests + latency check
- gate: approval required
- step: canary 10% with auto rollback
- step: full rollout
When finished, compare with the solution inside the Exercises section below.
Exercise 2: Define gates
You have a binary classifier. The current champion AUC is 0.82 and P95 latency target is 120 ms. Propose production gates for both quality and performance, including rollback signals.
Common mistakes and self-check
- Unpinned dependencies produce non-reproducible models. Self-check: can you rebuild the same image and get the same package versions today?
- Skipping data tests; code passes but model degrades due to bad data. Self-check: does every run validate schema and distributions before training?
- Training full datasets in PR CI makes pipelines slow and flaky. Self-check: do PRs use a small, fixed sample?
- No baseline comparison; “green” builds still ship worse models. Self-check: does each candidate compare to a champion with a defined tolerance?
- Manual, undocumented releases. Self-check: can a teammate promote the same version with one approval step?
- Missing rollback automation. Self-check: what exact signal triggers rollback and has it been tested in staging?
- Secrets in logs. Self-check: are secrets masked and stored in a secure secret manager?
Practical projects
- Project A: Containerize a training job with a small sample dataset. Add unit tests, data checks, and metric gate vs a stored baseline.
- Project B: Build a model-serving image with a simple REST interface. Add contract tests, latency checks, and canary deployment with rollback.
- Project C: Implement a data contract for one feature pipeline and enforce it in CI before training runs.
Learning path
- Before this: Version control for data and models, reproducible environments.
- This lesson: Build–Test–Release workflow design with gates and progressive delivery.
- Next: Model registry practices, production monitoring and drift, automated rollback playbooks.
Next steps
- Implement the minimal pipeline in a sandbox repo this week.
- Add just two gates: metric vs champion and P95 latency.
- Measure: lead time to change and change failure rate. Improve by making slow steps optional or moved to nightly jobs.
Mini challenge
Within one day, wire a single metric gate into your CI: train on a fixed 5% sample and fail the build if performance drops beyond your tolerance. Bonus: add an automatic rollback condition for canary deployments.
Exercises — solutions
Exercise 1 solution
pipeline:
build:
- step: install with lockfile (requirements.txt + hashes)
- step: build wheel (include version and commit SHA)
- step: docker build --label commit=<SHA> --label model_version=<ver>
- step: push wheel and image to registry
test:
- step: lint + unit tests (transform functions, utils)
- step: data checks (schema, ranges, nulls)
- step: train_on_sample.py --data snapshot_2025_01 --size 0.05
- step: evaluate vs champion (fail if AUC_drop > 0.01)
- step: dependency vulnerability scan + secret scan
release:
- step: deploy to staging with artifact version
- step: smoke endpoint + P95 latency <= 120ms on 1k requests
- gate: manual approval by MLOps on green
- step: canary 10% traffic, 30 min observation
- condition: rollback if error_rate > 2% or CTR_drop > 3%
- step: promote to 100% + tag releaseExercise 2 solution
Quality gate: Require candidate AUC >= 0.81 (champion 0.82 with 0.01 tolerance) on the same evaluation split; fail otherwise. Performance gate: On staging canary, P95 latency must be ≤ 120 ms for at least 1k requests. Rollback signals: automatic rollback if error rate exceeds 2% over 5 minutes or if online proxy metric (e.g., click-through) drops by more than 3% versus canary control during the bake period.