Why this matters
Stage promotion (Dev → Staging → Prod) makes ML releases repeatable, safe, and auditable. As an MLOps Engineer you will:
- Gate model releases with measurable checks (accuracy, latency, fairness, cost).
- Promote versions in a model registry with approvals and signed artifacts.
- Run canary/shadow deployments and roll back quickly if needed.
- Keep lineage: which data, code, and config produced the model now in Prod.
Who this is for
- Engineers and data scientists deploying models beyond notebooks.
- MLOps practitioners creating reliable, auditable model release pipelines.
Prerequisites
- Basic CI/CD concepts (build, test, deploy).
- Model registry basics: versions, stages, tags, artifacts.
- Ability to read simple YAML/JSON and logs.
Concept explained simply
A registry keeps every model version. A stage is a movable pointer (e.g., "staging", "prod") to the version currently approved for that environment. Promotion moves that pointer after passing gates.
Mental model
Think of a museum with many paintings (model versions). Only a few are displayed in the main hall (Prod). Curators (gates + approvals) decide which painting gets in. If a display has issues, they quickly swap it with the previous one (rollback).
Typical gates from Dev → Staging
- Reproducible build: artifact checksum and environment lockfile recorded.
- Unit/integration tests pass for feature code and inference service.
- Data/feature schema compatibility with production feature store.
- Offline metrics vs. baseline exceed thresholds (e.g., +2% AUC).
- Security scan: no secrets, dependencies vetted.
Typical gates from Staging → Prod
- Human approval from model owner and a reviewer.
- Performance SLO checks on staging traffic (latency, error rate).
- Risk guardrails (fairness/ drift within bounds; PII controls).
- Release plan: canary or shadow with rollback steps.
- Audit artifacts attached: dataset snapshot hash, training code commit, signed model.
Workflow: step-by-step
- Register: Push model artifact, metadata, metrics, lineage to the registry (stage=dev).
- Validate: Run automated checks; attach results to the version.
- Promote to Staging: If gates pass, move the stage pointer to this version.
- Evaluate on staging traffic: Shadow/canary; monitor SLOs and drift.
- Approve: Required reviewers sign off inside the registry.
- Promote to Prod: Update prod pointer; rollout plan (e.g., 10%→50%→100%).
- Monitor & Rollback: If SLOs are violated, revert prod pointer to last stable version.
Example promotion record (generic YAML):
{
"model": "fraud_classifier",
"version": 7,
"from_stage": "staging",
"to_stage": "prod",
"approvers": ["owner@company", "reviewer@company"],
"checks": {
"auc": 0.935,
"latency_p95_ms": 62,
"error_rate": 0.2,
"drift_psi": 0.07
},
"status": "approved",
"rollback_to": 6,
"artifacts": {
"model_digest": "sha256:abc...",
"data_snapshot": "s3://bucket/train-2024-09-01",
"code_commit": "git:1234abcd"
}
}Worked examples
Example 1: Dev → Staging with schema and metric gates
Model: churn_classifier v3
- Offline AUC: 0.88 vs baseline 0.85 (pass threshold +0.02)
- Feature schema diff: +1 optional feature; no removals (compatible)
- Unit/integration tests: pass
- Security scan: pass
Action: Promote to Staging. Attach validation report and dataset hash. Stage pointer moves to v3.
Example 2: Staging → Prod via canary and rollback
Model: fraud_detector v7 (staging)
- Staging shadow test: latency p95=62 ms (SLO <80 ms), error rate 0.2% (SLO <1%)
- Fairness: TPR difference < 3% across key segments (within policy)
- Approval: Owner + Reviewer signed
Action: Promote to Prod with 10% canary for 30 minutes; then 50% for 2 hours; then 100%. Monitoring detects no regressions. Finalize rollout.
Rollback path: If p95 > 80 ms for 5 minutes, revert prod pointer to v6 automatically.
Example 3: Champion–Challenger in Prod
Champion: recommendation_model v21 (prod). Challenger: v22 (staging).
- Traffic split: A/B 90%/10% for 24 hours.
- KPIs: CTR +1.5% needed with no latency regressions.
- Outcome: Challenger meets KPIs and stability. Promote v22 to Prod; v21 remains as fallback.
Common mistakes and self-check
- Skipping lineage: Fix: record data snapshot, code commit, params, environment lock.
- Only offline metrics: Fix: add online SLOs and at least a short canary.
- Mutable artifacts: Fix: store immutable, content-addressed artifacts; verify digests.
- No rollback plan: Fix: always note previous stable version and reversion criteria.
- Ignoring drift/fairness: Fix: include automated drift and fairness checks in gates.
Self-check questions:
- Can you name the exact gate checks for each stage transition?
- Do you know the immediate rollback target and trigger?
- Is the model artifact verifiably tied to its training data and code?
Practical projects
- Build a promotion pipeline that reads a policy.yaml and decides Dev → Staging automatically; requires manual approval for Staging → Prod.
- Create a drift monitor that blocks promotion when PSI >= 0.2 or KS p-value < 0.01.
- Implement a canary controller that updates stage pointers and writes a promotion log with result metrics.
Exercises
These exercises mirror the tasks below. Complete them here, then submit your answers in your workspace. Everyone can take the quick test; only logged-in users have their progress saved.
Exercise 1 — Define your promotion policy
Write a minimal policy file that gates Dev → Staging and Staging → Prod. Include metric thresholds, SLOs, approvals, drift, and rollback criteria.
What to include
- Offline metric thresholds vs. a named baseline.
- Latency and error-rate SLOs for online checks.
- Required approvers (roles or emails).
- Drift/fairness limits.
- Rollback trigger and target.
Exercise 2 — Decide promote or block from a log
Given a fictional staging run log, decide whether to promote to Prod. State Promote or Block and list reasons.
Staging run log
{
"model": "claims_risk",
"version": 12,
"baseline_version": 10,
"offline": {"auc": 0.901, "baseline_auc": 0.892},
"online": {"latency_p95_ms": 95, "error_rate": 0.004},
"drift": {"psi": 0.18},
"fairness": {"tpr_gap": 0.05},
"approvals": ["owner@org"],
"required_approvals": 2
}Checklist: before promoting
- Immutable artifact with digest recorded.
- Data snapshot and code commit attached to the registry version.
- Offline metrics exceed baseline thresholds.
- Staging SLOs met (latency, errors) under shadow/canary.
- Drift and fairness within policy limits.
- Approvals completed and logged.
- Rollback target identified and tested.
Learning path
- Model versioning and metadata →
- Registry stages and approvals →
- Automated validation and drift detection →
- Release strategies: shadow, canary, A/B →
- Observability and rollback automation
Next steps
- Turn your policy into automated gate checks in CI/CD.
- Add monitoring alerts tied to rollback conditions.
- Run a dry-run promotion and practice a rollback.
Mini challenge
You have prod v6 and want to promote v7 with +3% AUC but +12 ms latency at p95 (still under SLO). Draft a short promotion note: gates passed, rollout plan, and rollback criteria if CTR drops > 1% in 30 minutes.
Quick Test
Take the test to confirm understanding. Anyone can take it for free; sign in to save progress.