Why this matters
As an Applied Scientist, your models only create value when they run reliably in production. Working smoothly with engineering ensures deployments are safe, fast, and maintainable.
- Translate research into an engineering-ready spec (API, latency, throughput, hardware).
- Package models for CI/CD and environment parity.
- Plan rollouts, guardrails, monitoring, and rollback paths.
- Define data and feature contracts to prevent silent failures.
Concept explained simply
Deployment is turning your model into a dependable service or batch job that fits the company’s systems and reliability standards. You and engineering co-own the path from artifact to customer impact.
Mental model
- Contract: What the service does (inputs/outputs, SLOs, failure modes).
- Container: How it runs (dependencies, resources, reproducibility).
- Controls: How it’s observed and changed (metrics, alerts, config, rollback).
Key collaboration touchpoints
- Scoping & SLOs: Agree on latency/throughput/error targets, cost budgets, and privacy constraints.
- API & data contracts: Define inputs, outputs, versioning, and schema/stability for features and labels.
- Packaging: Container/image, model artifact format, dependency pinning, hardware needs (CPU/GPU).
- Testing: Unit, integration, load, cold-start; golden datasets and acceptance criteria.
- Release strategy: Canary, shadow, A/B; feature flags; rollback plans.
- Observability: Metrics (latency, errors, throughput), model metrics (drift, quality), logs, traces, alerts.
- Operations: On-call expectations, runbooks, retraining cadence, version deprecation policy.
Worked examples
Example 1: Real-time toxicity classifier under 80 ms p95
- Contract: POST /moderate, input: text up to 300 chars, output: score 0–1 + label; p95 latency ≤ 80 ms at 200 RPS; 99.9% availability.
- Packaging: ONNX model, Python runtime; dependencies pinned; CPU-optimized inference with quantization.
- Testing: Golden set of 1,000 comments; acceptance: AUC ≥ 0.92, latency ≤ 80 ms p95, error rate ≤ 0.1%.
- Release: Shadow for 3 days → Canary 5% traffic → 50% → 100% if no regressions.
- Observability: Metrics: latency, errors; model: score distribution, drift vs. baseline; alert if p95 > 80 ms for 15 min.
- Runbook: If latency spikes, auto-scale; if quality degrades, flip feature flag to previous version.
Example 2: Batch churn prediction nightly
- Contract: Daily job 02:00 UTC; input: yesterday’s features table; output: scores table by 04:00 UTC.
- Packaging: Spark job container; model in Parquet artifact; schema versioned.
- Testing: Integration test on 1% sample; SLA: finish in 90 min; checkpointing enabled.
- Release: Backfill last 7 days in staging; dry-run in prod writing to shadow table for 3 days.
- Observability: Duration, records processed, null rate, drift; alert on SLA miss or null rate > 0.5%.
- Runbook: On SLA miss, auto-increase executors; on schema change, trigger contract failure and pause job.
Example 3: Embedding search with A/B experiment
- Contract: gRPC service getSimilar(items, k); p95 ≤ 120 ms at 100 RPS; index refresh every 6 hours.
- Packaging: GPU-enabled container for embedding generation; FAISS index persisted to object storage.
- Testing: Offline NDCG uplift ≥ +3%; online guardrail CTR no worse than baseline by −1%.
- Release: Canary 10% traffic behind flag; automatic rollback if error rate > 0.5% or latency p95 > 120 ms.
- Observability: Index freshness, recall on probes, GPU utilization, latency, CTR by variant.
- Runbook: If recall drops, rebuild index; if GPU saturation > 90%, scale replicas or switch to CPU fallbacks.
Engineering-ready spec template
- Purpose: What decision/action this model informs.
- API Contract: endpoint, request schema, response schema, error codes, versioning.
- SLOs/SLA: latency p95/p99, throughput, availability, cost budget.
- Model Artifact: format, size, checksum, training code ref, data snapshot ID, license/PII notes.
- Runtime: language, dependencies, hardware, concurrency model, warmup behavior.
- Testing: datasets, acceptance thresholds, load profile, failure injections.
- Observability: metrics, logs, traces, dashboards, alerts.
- Release Plan: environments, gates, canary criteria, rollback triggers.
- Operations: retraining cadence, feature store contracts, on-call, runbooks.
Exercises
Do these to practice deployment collaboration. You can compare with the provided solutions.
Exercise 1: Draft an API contract for a real-time ranking scorer
Scenario: The model scores up to 50 candidate items for a user, returning a score per item. Target latency: ≤ 120 ms p95 at 150 RPS. Include input/output schemas, error handling, and versioning.
- Deliverable: A concise API contract (request/response JSON), SLOs, and error codes.
- Timebox: 15 minutes.
Show solution
{
"endpoint": "POST /v1/rank:score",
"request": {
"user_id": "string",
"context": {"locale": "string", "ts": "iso8601"},
"candidates": [
{"item_id": "string", "features": {"price": "number", "category": "string"}}
],
"request_id": "string",
"model_version": "semver"
},
"response": {
"scores": [{"item_id": "string", "score": "number"}],
"served_model_version": "semver",
"latency_ms": "number",
"request_id": "string"
},
"limits": {"max_candidates": 50, "payload_kb": 64},
"slos": {"p95_latency_ms": 120, "availability": "99.9%", "error_rate": "<0.1%"},
"errors": [
{"code": "INVALID_ARGUMENT", "http": 400, "reason": "schema mismatch or too many candidates"},
{"code": "RESOURCE_EXHAUSTED", "http": 429, "reason": "rate limit"},
{"code": "INTERNAL", "http": 500, "reason": "unexpected error"}
],
"versioning": {"request": "backward-compatible for 6 months", "deprecation": "warning header + docs"}
}
Exercise 2: Write a safe rollout plan
Scenario: You’re replacing a v1 fraud model with v2. List canary steps, metrics to watch, thresholds, and rollback triggers. Include shadow testing if useful.
- Deliverable: A 6–8 step rollout with clear gates.
- Timebox: 15 minutes.
Show solution
- Shadow test v2 for 3 days on 100% traffic, no user impact; compare precision/recall and latency to v1 on matched requests.
- Gate 1: Shadow precision within −1% of v1, recall ≥ v1, p95 latency ≤ v1 + 10 ms.
- Canary 5% traffic under feature flag for 24 hours; monitor error rate, p95 latency, fraud capture rate, false positive rate.
- Gate 2: Error rate < 0.2%, p95 latency < 120 ms, capture rate ≥ v1, FPR ≤ v1 + 0.2%.
- Ramp to 25% for 24 hours with the same guardrails plus business KPIs (chargeback rate).
- Gate 3: No negative drift; alert-free window 12 hours.
- Ramp to 100%; keep flag for 7 days for emergency rollback.
- Post-deploy review and tag v1 as deprecated; schedule removal in 30 days.
Deployment readiness checklist
- API contract reviewed and approved by engineering
- Dependencies pinned and container built deterministically
- Golden tests pass (functional + load)
- Canary + rollback plan documented
- Metrics, dashboards, and alerts configured
- Data/feature contracts validated in staging
- Runbook and on-call ownership confirmed
Common mistakes and self-check
- Vague SLOs: Without p95/p99 targets, capacity planning fails. Self-check: Are latency and throughput numbers explicit?
- Unpinned dependencies: Leads to different behavior across environments. Self-check: Are versions locked and reproducible?
- No rollback path: Slows incident response. Self-check: Can you revert with a flag or version flip within minutes?
- Missing feature/data contracts: Schema drift causes silent failures. Self-check: Are schema checks and alerts in place?
- Observability gaps: You can’t fix what you can’t see. Self-check: Can you view latency, errors, and model metrics in one dashboard?
- Overfitting to offline metrics: Online behavior differs. Self-check: Is there a controlled rollout with guardrails?
Who this is for
- Applied Scientists moving models from research to production.
- Data Scientists collaborating with platform/ML engineers.
- Engineers seeking clearer model deployment specs.
Prerequisites
- Basic containerization knowledge (images, dependencies).
- Familiarity with metrics (latency percentiles, error rates).
- Understanding of your model’s inputs/outputs and performance metrics.
Learning path
- Draft a minimal API + SLO spec for your current model.
- Package the model with pinned dependencies; verify reproducible builds.
- Add golden tests and a small load test plan.
- Define canary steps and rollback triggers.
- Implement metrics and alerts; run a staging dry-run.
Practical projects
- Create a toy recommendation API that serves scores for 10 items; deploy locally with a canary switch and latency logging.
- Convert a notebook model into a container with a reproducible build; add a golden test suite.
- Set up a drift detector for a batch model using population stats and alert when PSI exceeds a threshold.
Next steps
- Apply the spec template to your real project.
- Review it with an engineer; refine based on feedback.
- Run a staging deployment rehearsal using the checklist.
Mini challenge
Your model’s p95 latency in staging is 2x higher than expected. List three hypotheses and the fastest validation for each.
Example approaches
- Cold start/warmup missing → Add warmup endpoint; re-measure steady-state.
- Dependency causing overhead → Profile with timing logs; remove/replace hotspot.
- Batch size mismatch → Tune concurrency/batching; compare p95 under target RPS.
Quick Test
Everyone can take the test for free. Progress is saved only if you are logged in.