Why this skill matters for MLOps Engineers
Latency and throughput optimization
- Concurrency: set multiple workers or threads; ensure model is thread-safe or use process workers.
- Batching: group small requests to amortize overhead (timer-based micro-batching).
- Warmup: load model on startup and run a dummy prediction to JIT/initialize kernels.
- Numerics: enable inference-optimized runtimes (e.g., ONNX/TensorRT) or quantized weights when accuracy allows.
- I/O: validate inputs efficiently; avoid excessive JSON parsing; prefer gRPC for high-throughput internal calls.
- Resource sizing: right-size CPU/GPU and autoscaling policies around p95/p99 SLOs.
Debugging slow p95 latency
- Check cold starts (autoscaler too aggressive).
- Profile CPU vs I/O wait; add connection pooling.
- Confirm batching window not too large.
- Pin BLAS/OpenMP thread counts to avoid thrashing.
Monitoring, canary, A/B, and rollback
- Versioned endpoints: keep /v1 stable while developing /v2; deprecate with a timetable.
- Canary: send 1–10% traffic to new version; compare error rate and latency to baseline.
- A/B: holdout-based experimentation; measure business metrics with controlled cohorts.
- Rollback strategy: one switch (config/flag) to route all traffic back; keep last N model artifacts.
Operational checklist
- Health probes: /health (readiness) and startup checks.
- Idempotency: safe retries with request IDs.
- Observability: logs with correlation IDs; metrics for QPS, p95, error rates; traces for hotspots.
- Security: input validation, auth where needed, and resource limits.
Drills and exercises
- Create a Docker image under 300 MB for a small model by trimming dependencies.
- Add a /metadata endpoint that returns model version, created_at, and git commit.
- Implement request validation that rejects malformed arrays with clear messages.
- Convert a REST endpoint to gRPC and compare average latency under load.
- Add a simple dynamic batching layer: queue requests for up to 20 ms, then score together.
- Simulate a canary: route 5% of traffic to v2 using a header or config.
- Practice rollback: switch all traffic back and verify no errors or stale caches.
Mini project: Online + Batch serving with safe rollout
Build a minimal ML service that supports both online predictions (REST) and a batch scoring job, with versioned endpoints and a canary switch.
- Train a small classifier and save a pipeline artifact.
- REST service: /health, /predict (JSON input), /metadata.
- Batch job: read CSV, write predictions CSV, emit run logs.
- Versioning: expose /v1 and /v2; v2 changes model or thresholds.
- Canary: route 5% of requests to /v2 via a config flag/header.
- Rollback: single config toggle to send 100% back to /v1.
Acceptance criteria
- p95 latency under 150 ms on a small test set (local run).
- On failure injection, rollback restores normal error rate within 1 minute.
- Batch job handles 1M rows with chunking and constant memory.
Common mistakes and debugging tips
- Training-serving skew: fix by serializing the exact preprocessing pipeline.
- Unpinned dependencies: always pin versions; rebuild images when upgrading.
- Single-thread bottlenecks: increase workers; avoid heavy global locks.
- No health/readiness checks: add startup warmup and readiness gates.
- Ignoring backpressure: configure timeouts, queues, and autoscaling policies.
- Canary without metrics: define success metrics before sending real traffic.
Quick debug playbook
- 503/timeout spikes: check readiness gates and dependency services.
- High CPU: profile preprocessing; consider vectorization or compiled libs.
- Memory leaks: ensure model objects aren’t reloaded per request; watch object pools.
Subskills
- Model Artifact Packaging — create reproducible, portable model bundles and images.
- Serving Patterns Online And Batch — choose REST/gRPC for realtime or containers/jobs for offline.
- Rest And Grpc API Design — define contracts, validation, and backward compatibility.
- Preprocessing At Inference — guarantee identical transforms as in training.
- Latency And Throughput Optimization — tune workers, batching, warmup, and sizing.
- Versioned Endpoints — run /v1 and /v2 safely and deprecate with a plan.
- Canary And A B Releases — measure impact with controlled traffic splits.
- Rollback Strategy — revert quickly with a single change.
Next steps
- Harden observability: add structured logs, metrics (QPS, p95, errors), and traces.
- Add CI for images and smoke tests on /health and /predict.
- Introduce load testing to validate SLOs before production.
Tip: Keep artifacts audit-ready
Store model, code commit, training data hash, and parameters together. This speeds incident response and compliance reviews.