Why this matters
As an MLOps Engineer, choosing the right serving pattern determines reliability, cost, and user experience. You will repeatedly decide between online (real-time) vs batch (offline) inference, and hybrids. These choices affect SLAs, infrastructure, data freshness, monitoring, and how fast you can ship improvements.
- Product impact: low-latency recommendations, fraud checks, search ranking.
- Operations impact: nightly scoring jobs, weekly reports, backfills.
- Risk control: blue/green, canary, shadow, rollbacks.
Who this is for
- MLOps Engineers and ML Engineers moving models to production.
- Data Engineers supporting inference pipelines.
- Software Engineers integrating ML endpoints.
Prerequisites
- Know how to package a model (e.g., in a container) and expose a simple API or job.
- Basic understanding of message queues, schedulers, and autoscaling.
- Familiarity with metrics and logging (latency, throughput, errors).
Concept explained simply
Serving patterns decide when, how fast, and how your model makes predictions.
- Online (real-time): request comes in, model predicts now. Typical latency target: milliseconds to seconds.
- Async online: request is queued, prediction delivered later (notifications, callback, or polling).
- Streaming/micro-batch: continuous data, predictions processed in small batches (e.g., every few seconds).
- Batch: large dataset processed on a schedule (hourly, nightly, weekly), results stored for later use.
- Hybrid: bulk precompute (batch) + quick online re-ranking or filtering.
Mental model: Think of a restaurant.
- Online = Ă la carte: cook per order, low wait time, higher per-plate cost.
- Batch = catering: prepare many meals at once, cheaper per portion, not instant.
- Hybrid = cater most, then finish Ă la minute for freshness.
How to choose a serving pattern
- Latency/SLA: Do users need the answer within 100 ms? Choose online synchronous. If minutes are fine, async or batch.
- Throughput and cost: High QPS with moderate latency tolerance can use request batching, caches, or async.
- Feature freshness: If features require expensive joins, precompute with batch and serve final adjustments online.
- Error tolerance: If retries are fine and exact timing isn’t critical, batch or async fits.
- Data delivery mode: Event streams suggest streaming/micro-batch; static tables suggest batch.
Key patterns (with triggers)
Synchronous online (request-response)
- Trigger: sub-200 ms latency target; user-facing action (search, checkout risk).
- Shape: Client -> API Gateway -> Model Service -> Feature Store/Cache -> Response.
- Notes: Add request batching (e.g., combine small requests), caching, and autoscaling.
Asynchronous online (queue/callback)
- Trigger: seconds to minutes acceptable; work may spike; users can be notified later.
- Shape: Client -> Queue -> Workers -> Storage -> Notification/Polling.
- Notes: Use idempotency keys and retries; great for heavy models.
Streaming / micro-batch
- Trigger: continuous events; need near-real-time (seconds) updates.
- Shape: Stream -> Micro-batch processors -> Feature aggregation -> Model scoring -> Sink.
- Notes: Windowing (e.g., 5s, 1m); exactly-once or at-least-once semantics.
Batch scoring
- Trigger: non-urgent predictions at scale; nightly/weekly jobs.
- Shape: Scheduler -> Job -> Read dataset -> Predict -> Write to table/storage.
- Notes: Great for cost control and complex feature joins.
Hybrid (batch precompute + online re-rank)
- Trigger: combine cheap bulk candidates with fast personalized reranking.
- Shape: Batch produce candidates -> Store -> Online service reranks for each user.
- Notes: Common for recommendations and search.
Architecture steps (text sketches)
- Receive request with correlation ID.
- Fetch features from cache/feature store.
- Run model, return response; log latency and features used.
- Emit metrics and traces; enable autoscaling on CPU/GPU/QPS.
- Validate request; enqueue with idempotency key.
- Worker consumes, fetches inputs, predicts, stores result.
- Notify client or allow polling.
- Retry on failure; dead-letter queue for poison messages.
- Schedule job.
- Read snapshot/partitioned data.
- Predict in parallel batches.
- Write outputs with version and timestamp; produce run report.
Worked examples
1) Fraud check at checkout (online synchronous)
- Latency target: 100 ms P95. SLA breach stalls checkout.
- Pattern: Online sync with feature cache; warm containers; circuit breaker to fallback rules.
- Extras: Request batching disabled (adds latency); shadow new model for comparison.
2) Monthly churn scores (batch)
- Latency target: 24 hours. Marketing uses a table of scores.
- Pattern: Batch job with partitioned input by month; backfill-friendly.
- Extras: Write to a versioned table; include data lineage and model hash.
3) Recommendations (hybrid)
- Batch: Generate 200 item candidates per user nightly.
- Online: Re-rank top 20 based on session context in 50 ms.
- Extras: Canary deploy reranker; log impressions for retraining.
Quick sizing and SLOs
- Latency budget: Network + feature fetch + model + serialization. If budget is 120 ms, aim for model compute < 60 ms.
- Throughput: QPS Ă— avg work per request. Use autoscaling and request batching (if latency allows).
- Batch window: Input size / parallelism / per-record cost. Leave margin for retries.
Implementation checklist
- Define SLA: latency target (P95), error budget, throughput range.
- Decide pattern: sync, async, streaming, batch, or hybrid.
- Inputs/outputs: schemas, idempotency key, versioning.
- Scaling: min/max replicas; GPU/CPU; concurrency per worker.
- Resilience: retries, timeouts, circuit breakers, dead-letter queues.
- Observability: request IDs, structured logs, metrics, traces, model/feature version tags.
- Rollouts: canary, shadow, rollback plan.
- Data management: feature freshness, training-serving skew checks, drift alerts.
Exercises
Everyone can use the exercises and test. Only logged-in users have their progress saved.
Exercise 1 (ex1): Pick the right serving pattern
For each scenario, choose a serving pattern and justify with SLA, cost, and data freshness.
- A) Mobile app displays risk score during signup (target P95 200 ms).
- B) Marketing team wants new lead scores by 8 AM daily.
- C) IoT sensors stream temperature data; alert if anomaly within 10 seconds.
Deliverable: a short table or bullets mapping scenario → pattern → 2–3 reasons.
Exercise 2 (ex2): Design minimal interfaces
Design both an online endpoint and a batch job for the same churn model.
- Online: define request/response JSON with idempotency key and version.
- Batch: define input table fields, output table fields, and daily schedule.
- Include scaling triggers and a retry policy.
Common mistakes and self-check
- Mistake: Using online sync for workloads with minute-level tolerance. Self-check: Is user waiting? If not, consider async/batch.
- Mistake: Ignoring feature freshness. Self-check: Document max acceptable data staleness per feature.
- Mistake: No idempotency for async/batch. Self-check: Can you safely retry the same request/job?
- Mistake: Unbounded latency due to cold starts. Self-check: Min replicas, warm-up probes, and caches configured?
- Mistake: No rollout strategy. Self-check: Canary/shadow and rollback documented?
Practical projects
- Build a REST model service with a 150 ms P95 target and autoscaling. Add structured logs and P95 metrics.
- Create a nightly batch scoring pipeline that writes to a versioned table and generates a run report with success/fail counts.
- Implement a hybrid recommender: batch candidate generation + online reranker with canary deployment.
- Set up an async queue-based endpoint for long-running image classification with callback notification.
Learning path
- Start: Understand SLAs and traffic patterns; pick the serving pattern.
- Next: Define interfaces and schemas; add idempotency and versioning.
- Then: Add observability and autoscaling; define rollout strategy.
- Finally: Implement drift monitoring and data-quality checks.
Next steps
- Complete the exercises; compare with the provided solutions.
- Take the quick test below to confirm understanding.
- Apply a chosen pattern to a small internal project this week.
Mini challenge
You need to classify support tickets into categories. Response time within 2 minutes is fine. Volume spikes during the day. Propose a serving pattern, minimal architecture, and two resilience mechanisms. Keep it to 5–7 bullet points.