How to learn Stage Promotion Dev Stage Prod for Model Registry And Artifact Management in MLOps Engineer for free

Why this matters

Stage promotion (Dev → Staging → Prod) makes ML releases repeatable, safe, and auditable. As an MLOps Engineer you will:

Gate model releases with measurable checks (accuracy, latency, fairness, cost).
Promote versions in a model registry with approvals and signed artifacts.
Run canary/shadow deployments and roll back quickly if needed.
Keep lineage: which data, code, and config produced the model now in Prod.

Who this is for

Engineers and data scientists deploying models beyond notebooks.
MLOps practitioners creating reliable, auditable model release pipelines.

Prerequisites

Basic CI/CD concepts (build, test, deploy).
Model registry basics: versions, stages, tags, artifacts.
Ability to read simple YAML/JSON and logs.

Concept explained simply

A registry keeps every model version. A stage is a movable pointer (e.g., "staging", "prod") to the version currently approved for that environment. Promotion moves that pointer after passing gates.

Mental model

Think of a museum with many paintings (model versions). Only a few are displayed in the main hall (Prod). Curators (gates + approvals) decide which painting gets in. If a display has issues, they quickly swap it with the previous one (rollback).

Typical gates from Dev → Staging

Reproducible build: artifact checksum and environment lockfile recorded.
Unit/integration tests pass for feature code and inference service.
Data/feature schema compatibility with production feature store.
Offline metrics vs. baseline exceed thresholds (e.g., +2% AUC).
Security scan: no secrets, dependencies vetted.

Typical gates from Staging → Prod

Human approval from model owner and a reviewer.
Performance SLO checks on staging traffic (latency, error rate).
Risk guardrails (fairness/ drift within bounds; PII controls).
Release plan: canary or shadow with rollback steps.
Audit artifacts attached: dataset snapshot hash, training code commit, signed model.

Workflow: step-by-step

Register: Push model artifact, metadata, metrics, lineage to the registry (stage=dev).
Validate: Run automated checks; attach results to the version.
Promote to Staging: If gates pass, move the stage pointer to this version.
Evaluate on staging traffic: Shadow/canary; monitor SLOs and drift.
Approve: Required reviewers sign off inside the registry.
Promote to Prod: Update prod pointer; rollout plan (e.g., 10%→50%→100%).
Monitor & Rollback: If SLOs are violated, revert prod pointer to last stable version.

Example promotion record (generic YAML):

{
  "model": "fraud_classifier",
  "version": 7,
  "from_stage": "staging",
  "to_stage": "prod",
  "approvers": ["owner@company", "reviewer@company"],
  "checks": {
    "auc": 0.935,
    "latency_p95_ms": 62,
    "error_rate": 0.2,
    "drift_psi": 0.07
  },
  "status": "approved",
  "rollback_to": 6,
  "artifacts": {
    "model_digest": "sha256:abc...",
    "data_snapshot": "s3://bucket/train-2024-09-01",
    "code_commit": "git:1234abcd"
  }
}

Worked examples

Example 1: Dev → Staging with schema and metric gates

Model: churn_classifier v3

Offline AUC: 0.88 vs baseline 0.85 (pass threshold +0.02)
Feature schema diff: +1 optional feature; no removals (compatible)
Unit/integration tests: pass
Security scan: pass

Action: Promote to Staging. Attach validation report and dataset hash. Stage pointer moves to v3.

Example 2: Staging → Prod via canary and rollback

Model: fraud_detector v7 (staging)

Staging shadow test: latency p95=62 ms (SLO <80 ms), error rate 0.2% (SLO <1%)
Fairness: TPR difference < 3% across key segments (within policy)
Approval: Owner + Reviewer signed

Action: Promote to Prod with 10% canary for 30 minutes; then 50% for 2 hours; then 100%. Monitoring detects no regressions. Finalize rollout.

Rollback path: If p95 > 80 ms for 5 minutes, revert prod pointer to v6 automatically.

Example 3: Champion–Challenger in Prod

Champion: recommendation_model v21 (prod). Challenger: v22 (staging).

Traffic split: A/B 90%/10% for 24 hours.
KPIs: CTR +1.5% needed with no latency regressions.
Outcome: Challenger meets KPIs and stability. Promote v22 to Prod; v21 remains as fallback.

Common mistakes and self-check

Skipping lineage: Fix: record data snapshot, code commit, params, environment lock.
Only offline metrics: Fix: add online SLOs and at least a short canary.
Mutable artifacts: Fix: store immutable, content-addressed artifacts; verify digests.
No rollback plan: Fix: always note previous stable version and reversion criteria.
Ignoring drift/fairness: Fix: include automated drift and fairness checks in gates.

Self-check questions:

Can you name the exact gate checks for each stage transition?
Do you know the immediate rollback target and trigger?
Is the model artifact verifiably tied to its training data and code?

Practical projects

Build a promotion pipeline that reads a policy.yaml and decides Dev → Staging automatically; requires manual approval for Staging → Prod.
Create a drift monitor that blocks promotion when PSI >= 0.2 or KS p-value < 0.01.
Implement a canary controller that updates stage pointers and writes a promotion log with result metrics.

Exercises

These exercises mirror the tasks below. Complete them here, then submit your answers in your workspace. Everyone can take the quick test; only logged-in users have their progress saved.

Exercise 1 — Define your promotion policy

Write a minimal policy file that gates Dev → Staging and Staging → Prod. Include metric thresholds, SLOs, approvals, drift, and rollback criteria.

What to include

Offline metric thresholds vs. a named baseline.
Latency and error-rate SLOs for online checks.
Required approvers (roles or emails).
Drift/fairness limits.
Rollback trigger and target.

Exercise 2 — Decide promote or block from a log

Given a fictional staging run log, decide whether to promote to Prod. State Promote or Block and list reasons.

Staging run log

{
  "model": "claims_risk",
  "version": 12,
  "baseline_version": 10,
  "offline": {"auc": 0.901, "baseline_auc": 0.892},
  "online": {"latency_p95_ms": 95, "error_rate": 0.004},
  "drift": {"psi": 0.18},
  "fairness": {"tpr_gap": 0.05},
  "approvals": ["owner@org"],
  "required_approvals": 2
}

Checklist: before promoting

Immutable artifact with digest recorded.
Data snapshot and code commit attached to the registry version.
Offline metrics exceed baseline thresholds.
Staging SLOs met (latency, errors) under shadow/canary.
Drift and fairness within policy limits.
Approvals completed and logged.
Rollback target identified and tested.

Learning path

Model versioning and metadata →
Registry stages and approvals →
Automated validation and drift detection →
Release strategies: shadow, canary, A/B →
Observability and rollback automation

Next steps

Turn your policy into automated gate checks in CI/CD.
Add monitoring alerts tied to rollback conditions.
Run a dry-run promotion and practice a rollback.

Mini challenge

You have prod v6 and want to promote v7 with +3% AUC but +12 ms latency at p95 (still under SLO). Draft a short promotion note: gates passed, rollout plan, and rollback criteria if CTR drops > 1% in 30 minutes.

Quick Test

Take the test to confirm understanding. Anyone can take it for free; sign in to save progress.

Menu

Stage Promotion Dev Stage Prod

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Workflow: step-by-step

Worked examples

Common mistakes and self-check

Practical projects

Exercises

Exercise 1 — Define your promotion policy

Exercise 2 — Decide promote or block from a log

Checklist: before promoting

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Define your promotion policy

Instructions

Expected Output

Promote or block from a staging log

Stage Promotion Dev Stage Prod — Quick Test

Have questions about Stage Promotion Dev Stage Prod?

AI Assistant