How to learn Model Promotion Practices for MLOps Foundations in MLOps Engineer for free

Why this matters

Stages and promotion gates

1) Dev → Staging

Reproducibility: run trains deterministically with pinned dependencies.
Contract tests: features, schema, and API signature unchanged or versioned.
Performance gate: meets or exceeds baseline by agreed deltas (e.g., +1.5% F1, −5% MAPE).
Bias/fairness: disparity metrics within limits (e.g., demographic parity ratio ≥ 0.8).
Security/compliance: no PII leaks; license scan passes.

2) Staging → Production

Load and latency: p95 latency below SLO; memory/CPU within budget.
Shadow or canary results acceptable (no material regression vs. prod).
Monitoring in place: drift, performance, and alerts configured.
Rollback plan documented and tested.
Approvals: model owner and business/data steward sign-off.

3) Production strategies

Blue/green: prepare parallel prod, switch traffic when ready.
Canary: send 1–10% traffic to the new model, then ramp up.
Champion/challenger: keep current champion; trial challenger until it proves better.

4) Post-promotion

Verify: compare live metrics to expectations for a set burn-in window (e.g., 24–72 hours).
Finalize: tag version as production and archive reports.
Observe: trigger rollback if alerts breach thresholds.

Artifacts to track for every promotion

Model version and registry ID.
Training code commit hash and dependency lockfile.
Training data fingerprint (hash, snapshot ID, or time bounds).
Metrics report (offline + shadow/canary), including confidence intervals when relevant.
Fairness/safety evaluation summary.
Latency/cost profile and serving resource requirements.
Promotion decision record with approvers and timestamps.
Rollback plan reference and monitoring configuration.

Worked examples

Example 1 — Regression (MAPE) with drift gate

Scenario: A revenue forecast model must have MAPE ≤ 12%, and data drift (PSI) ≤ 0.2 compared to the last 30 days.

Baseline: MAPE 12.8%, PSI 0.08
Candidate: MAPE 10.9%, PSI 0.16
Latency: p95 = 120ms (SLO ≤ 200ms)
Decision: Promote to staging (passes gates) → Canary in prod (5%) with rollback if live MAPE > 12%.

Why it works: The candidate improves accuracy and stays within the drift and latency limits.

Example 2 — Classifier with canary and rollback

Scenario: Fraud classifier SLOs: F1 ≥ 0.80 (offline), p95 latency ≤ 150ms, false positive rate (FPR) must not increase by more than +1% absolute vs. champion.

Champion: F1 0.79, FPR 2.1%
Candidate: F1 0.82, FPR 2.9%  (ΔFPR +0.8%)
Plan: Canary at 5%, alert if FPR > 3.1% or F1 < 0.79 on live labeled subset.
Promotion after 24h if stable; rollback otherwise.

Why it works: Balances overall F1 improvement while controlling risk of extra false positives.

Example 3 — Champion/Challenger A/B

Scenario: Recommendation model with business KPI (click-through rate) as primary, latency as guardrail.

Offline gains are modest; deploy challenger to 10% users.
Promote if CTR uplift ≥ +0.7% (absolute) over 7 days with p-value ≤ 0.05 and p95 latency ≤ 180ms.
If uplift is < 0.7%, keep challenger for learning but do not promote.

Why it works: Uses a business-grounded threshold and a time-bounded decision rule.

Who this is for

MLOps engineers establishing safe release workflows.
ML engineers seeking reliable paths from notebooks to production.
Tech leads who need auditability and lower incident risk.

Prerequisites

Basic ML lifecycle understanding (train, validate, deploy, monitor).
Comfort with version control, containers, and CI/CD concepts.
Familiarity with model registries and artifact tracking.
Knowing core metrics for your problem type (regression/classification/rec).

Learning path

Define environments and responsibilities (Dev, Staging, Prod; who approves what).
Draft a promotion policy with measurable gates and rollback.
Automate gates in CI: reproducibility, tests, metrics checks.
Add a progressive delivery strategy (canary or blue/green).
Wire monitoring and drift alerts before production exposure.

Mini task — Turn a metric into a gate

Pick your model metric (e.g., F1). Choose a minimum improvement over the current baseline (e.g., +1.5% absolute). Write it as: “Promote only if F1 ≥ 0.81 and ≥ +0.015 vs. baseline.”

Promotion readiness checklist

Reproducible training (code + data + dependencies pinned)
Contract tests pass (schema, API, feature expectations)
Metrics meet thresholds with confidence bounds
Fairness and safety checks pass
Latency and memory fit SLO/budget
Monitoring + alerts configured pre-promotion
Canary/blue-green plan and rollback steps documented
Approvals recorded with timestamps

Exercises

These mirror the graded exercises below. Draft your answer here first, then compare.

Exercise 1 — Write a promotion policy

Write a one-page promotion policy for a binary classifier used in customer risk scoring. Include gates for: performance vs. baseline, latency, fairness, drift, approvals, and rollback triggers.

Exercise 2 — Decide: Promote, Canary, or Hold?

Baseline: F1 0.78, AUC 0.85, p95 latency 130ms. Candidate: F1 0.81, AUC 0.86, disparity ratio 0.77 (threshold ≥ 0.8), drift PSI 0.12, p95 latency 170ms (SLO ≤ 180ms). What do you do and why?

Common mistakes and self-check

Promoting on a single metric: Use primary + guardrails (latency, cost, fairness).
No rollback plan: Write explicit triggers and steps; test rollback in staging.
Ignoring data lineage: Fingerprint training data; keep snapshots/time bounds.
Environment drift: Pin dependencies; test container image in staging.
Manual-only or auto-only: Combine automation for speed with approvals for safety.
Under-monitoring: Set alerts before exposing traffic; verify after promotion.

Self-check: Am I ready to promote?

Do I have side-by-side baseline vs. candidate metrics with deltas?
Can I rebuild the model byte-for-byte from stored artifacts?
Are drift, fairness, and latency within agreed thresholds?
Is the rollback procedure tested and documented?
Are approvers identified and captured?

Practical projects

Build a CI pipeline that registers models, runs metric gates, and creates a promotion ticket with all artifacts.
Implement a canary deployment and automatic rollback on metric regression using production telemetry.
Create a champion/challenger service that routes a small percentage of traffic and reports uplift with confidence.

Mini challenge

Design gates for a recommendation model where the primary KPI is CTR uplift and a guardrail is add-to-cart rate. Specify exact thresholds, sample sizes, and how long you will run a canary before deciding.

Note: The quick test is available to everyone; only logged-in users get saved progress.

Next steps

Add policy-as-code for gates (e.g., declarative checks in CI).
Integrate feature store lineage into promotion records.
Use blue/green or multi-region promotion for zero-downtime rollouts.
Standardize promotion templates across teams for consistency.

Menu

Model Promotion Practices

Table of Contents