Why this matters
Stages and promotion gates
1) Dev → Staging
- Reproducibility: run trains deterministically with pinned dependencies.
- Contract tests: features, schema, and API signature unchanged or versioned.
- Performance gate: meets or exceeds baseline by agreed deltas (e.g., +1.5% F1, −5% MAPE).
- Bias/fairness: disparity metrics within limits (e.g., demographic parity ratio ≥ 0.8).
- Security/compliance: no PII leaks; license scan passes.
2) Staging → Production
- Load and latency: p95 latency below SLO; memory/CPU within budget.
- Shadow or canary results acceptable (no material regression vs. prod).
- Monitoring in place: drift, performance, and alerts configured.
- Rollback plan documented and tested.
- Approvals: model owner and business/data steward sign-off.
3) Production strategies
- Blue/green: prepare parallel prod, switch traffic when ready.
- Canary: send 1–10% traffic to the new model, then ramp up.
- Champion/challenger: keep current champion; trial challenger until it proves better.
4) Post-promotion
- Verify: compare live metrics to expectations for a set burn-in window (e.g., 24–72 hours).
- Finalize: tag version as production and archive reports.
- Observe: trigger rollback if alerts breach thresholds.
Artifacts to track for every promotion
- Model version and registry ID.
- Training code commit hash and dependency lockfile.
- Training data fingerprint (hash, snapshot ID, or time bounds).
- Metrics report (offline + shadow/canary), including confidence intervals when relevant.
- Fairness/safety evaluation summary.
- Latency/cost profile and serving resource requirements.
- Promotion decision record with approvers and timestamps.
- Rollback plan reference and monitoring configuration.
Worked examples
Example 1 — Regression (MAPE) with drift gate
Scenario: A revenue forecast model must have MAPE ≤ 12%, and data drift (PSI) ≤ 0.2 compared to the last 30 days.
Baseline: MAPE 12.8%, PSI 0.08
Candidate: MAPE 10.9%, PSI 0.16
Latency: p95 = 120ms (SLO ≤ 200ms)
Decision: Promote to staging (passes gates) → Canary in prod (5%) with rollback if live MAPE > 12%.Why it works: The candidate improves accuracy and stays within the drift and latency limits.
Example 2 — Classifier with canary and rollback
Scenario: Fraud classifier SLOs: F1 ≥ 0.80 (offline), p95 latency ≤ 150ms, false positive rate (FPR) must not increase by more than +1% absolute vs. champion.
Champion: F1 0.79, FPR 2.1%
Candidate: F1 0.82, FPR 2.9% (ΔFPR +0.8%)
Plan: Canary at 5%, alert if FPR > 3.1% or F1 < 0.79 on live labeled subset.
Promotion after 24h if stable; rollback otherwise.Why it works: Balances overall F1 improvement while controlling risk of extra false positives.
Example 3 — Champion/Challenger A/B
Scenario: Recommendation model with business KPI (click-through rate) as primary, latency as guardrail.
Offline gains are modest; deploy challenger to 10% users.
Promote if CTR uplift ≥ +0.7% (absolute) over 7 days with p-value ≤ 0.05 and p95 latency ≤ 180ms.
If uplift is < 0.7%, keep challenger for learning but do not promote.Why it works: Uses a business-grounded threshold and a time-bounded decision rule.
Who this is for
- MLOps engineers establishing safe release workflows.
- ML engineers seeking reliable paths from notebooks to production.
- Tech leads who need auditability and lower incident risk.
Prerequisites
- Basic ML lifecycle understanding (train, validate, deploy, monitor).
- Comfort with version control, containers, and CI/CD concepts.
- Familiarity with model registries and artifact tracking.
- Knowing core metrics for your problem type (regression/classification/rec).
Learning path
- Define environments and responsibilities (Dev, Staging, Prod; who approves what).
- Draft a promotion policy with measurable gates and rollback.
- Automate gates in CI: reproducibility, tests, metrics checks.
- Add a progressive delivery strategy (canary or blue/green).
- Wire monitoring and drift alerts before production exposure.
Mini task — Turn a metric into a gate
Pick your model metric (e.g., F1). Choose a minimum improvement over the current baseline (e.g., +1.5% absolute). Write it as: “Promote only if F1 ≥ 0.81 and ≥ +0.015 vs. baseline.”
Promotion readiness checklist
- Reproducible training (code + data + dependencies pinned)
- Contract tests pass (schema, API, feature expectations)
- Metrics meet thresholds with confidence bounds
- Fairness and safety checks pass
- Latency and memory fit SLO/budget
- Monitoring + alerts configured pre-promotion
- Canary/blue-green plan and rollback steps documented
- Approvals recorded with timestamps
Exercises
These mirror the graded exercises below. Draft your answer here first, then compare.
Exercise 1 — Write a promotion policy
Write a one-page promotion policy for a binary classifier used in customer risk scoring. Include gates for: performance vs. baseline, latency, fairness, drift, approvals, and rollback triggers.
Exercise 2 — Decide: Promote, Canary, or Hold?
Baseline: F1 0.78, AUC 0.85, p95 latency 130ms. Candidate: F1 0.81, AUC 0.86, disparity ratio 0.77 (threshold ≥ 0.8), drift PSI 0.12, p95 latency 170ms (SLO ≤ 180ms). What do you do and why?
Common mistakes and self-check
- Promoting on a single metric: Use primary + guardrails (latency, cost, fairness).
- No rollback plan: Write explicit triggers and steps; test rollback in staging.
- Ignoring data lineage: Fingerprint training data; keep snapshots/time bounds.
- Environment drift: Pin dependencies; test container image in staging.
- Manual-only or auto-only: Combine automation for speed with approvals for safety.
- Under-monitoring: Set alerts before exposing traffic; verify after promotion.
Self-check: Am I ready to promote?
- Do I have side-by-side baseline vs. candidate metrics with deltas?
- Can I rebuild the model byte-for-byte from stored artifacts?
- Are drift, fairness, and latency within agreed thresholds?
- Is the rollback procedure tested and documented?
- Are approvers identified and captured?
Practical projects
- Build a CI pipeline that registers models, runs metric gates, and creates a promotion ticket with all artifacts.
- Implement a canary deployment and automatic rollback on metric regression using production telemetry.
- Create a champion/challenger service that routes a small percentage of traffic and reports uplift with confidence.
Mini challenge
Design gates for a recommendation model where the primary KPI is CTR uplift and a guardrail is add-to-cart rate. Specify exact thresholds, sample sizes, and how long you will run a canary before deciding.
Note: The quick test is available to everyone; only logged-in users get saved progress.
Next steps
- Add policy-as-code for gates (e.g., declarative checks in CI).
- Integrate feature store lineage into promotion records.
- Use blue/green or multi-region promotion for zero-downtime rollouts.
- Standardize promotion templates across teams for consistency.