luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Serving Patterns Online And Batch

Learn Serving Patterns Online And Batch for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Why this matters

As an MLOps Engineer, choosing the right serving pattern determines reliability, cost, and user experience. You will repeatedly decide between online (real-time) vs batch (offline) inference, and hybrids. These choices affect SLAs, infrastructure, data freshness, monitoring, and how fast you can ship improvements.

  • Product impact: low-latency recommendations, fraud checks, search ranking.
  • Operations impact: nightly scoring jobs, weekly reports, backfills.
  • Risk control: blue/green, canary, shadow, rollbacks.

Who this is for

  • MLOps Engineers and ML Engineers moving models to production.
  • Data Engineers supporting inference pipelines.
  • Software Engineers integrating ML endpoints.

Prerequisites

  • Know how to package a model (e.g., in a container) and expose a simple API or job.
  • Basic understanding of message queues, schedulers, and autoscaling.
  • Familiarity with metrics and logging (latency, throughput, errors).

Concept explained simply

Serving patterns decide when, how fast, and how your model makes predictions.

  • Online (real-time): request comes in, model predicts now. Typical latency target: milliseconds to seconds.
  • Async online: request is queued, prediction delivered later (notifications, callback, or polling).
  • Streaming/micro-batch: continuous data, predictions processed in small batches (e.g., every few seconds).
  • Batch: large dataset processed on a schedule (hourly, nightly, weekly), results stored for later use.
  • Hybrid: bulk precompute (batch) + quick online re-ranking or filtering.

Mental model: Think of a restaurant.

  • Online = Ă  la carte: cook per order, low wait time, higher per-plate cost.
  • Batch = catering: prepare many meals at once, cheaper per portion, not instant.
  • Hybrid = cater most, then finish Ă  la minute for freshness.

How to choose a serving pattern

  • Latency/SLA: Do users need the answer within 100 ms? Choose online synchronous. If minutes are fine, async or batch.
  • Throughput and cost: High QPS with moderate latency tolerance can use request batching, caches, or async.
  • Feature freshness: If features require expensive joins, precompute with batch and serve final adjustments online.
  • Error tolerance: If retries are fine and exact timing isn’t critical, batch or async fits.
  • Data delivery mode: Event streams suggest streaming/micro-batch; static tables suggest batch.

Key patterns (with triggers)

Synchronous online (request-response)
  • Trigger: sub-200 ms latency target; user-facing action (search, checkout risk).
  • Shape: Client -> API Gateway -> Model Service -> Feature Store/Cache -> Response.
  • Notes: Add request batching (e.g., combine small requests), caching, and autoscaling.
Asynchronous online (queue/callback)
  • Trigger: seconds to minutes acceptable; work may spike; users can be notified later.
  • Shape: Client -> Queue -> Workers -> Storage -> Notification/Polling.
  • Notes: Use idempotency keys and retries; great for heavy models.
Streaming / micro-batch
  • Trigger: continuous events; need near-real-time (seconds) updates.
  • Shape: Stream -> Micro-batch processors -> Feature aggregation -> Model scoring -> Sink.
  • Notes: Windowing (e.g., 5s, 1m); exactly-once or at-least-once semantics.
Batch scoring
  • Trigger: non-urgent predictions at scale; nightly/weekly jobs.
  • Shape: Scheduler -> Job -> Read dataset -> Predict -> Write to table/storage.
  • Notes: Great for cost control and complex feature joins.
Hybrid (batch precompute + online re-rank)
  • Trigger: combine cheap bulk candidates with fast personalized reranking.
  • Shape: Batch produce candidates -> Store -> Online service reranks for each user.
  • Notes: Common for recommendations and search.

Architecture steps (text sketches)

Online sync:
  1. Receive request with correlation ID.
  2. Fetch features from cache/feature store.
  3. Run model, return response; log latency and features used.
  4. Emit metrics and traces; enable autoscaling on CPU/GPU/QPS.
Async:
  1. Validate request; enqueue with idempotency key.
  2. Worker consumes, fetches inputs, predicts, stores result.
  3. Notify client or allow polling.
  4. Retry on failure; dead-letter queue for poison messages.
Batch:
  1. Schedule job.
  2. Read snapshot/partitioned data.
  3. Predict in parallel batches.
  4. Write outputs with version and timestamp; produce run report.

Worked examples

1) Fraud check at checkout (online synchronous)
  • Latency target: 100 ms P95. SLA breach stalls checkout.
  • Pattern: Online sync with feature cache; warm containers; circuit breaker to fallback rules.
  • Extras: Request batching disabled (adds latency); shadow new model for comparison.
2) Monthly churn scores (batch)
  • Latency target: 24 hours. Marketing uses a table of scores.
  • Pattern: Batch job with partitioned input by month; backfill-friendly.
  • Extras: Write to a versioned table; include data lineage and model hash.
3) Recommendations (hybrid)
  • Batch: Generate 200 item candidates per user nightly.
  • Online: Re-rank top 20 based on session context in 50 ms.
  • Extras: Canary deploy reranker; log impressions for retraining.

Quick sizing and SLOs

  • Latency budget: Network + feature fetch + model + serialization. If budget is 120 ms, aim for model compute < 60 ms.
  • Throughput: QPS Ă— avg work per request. Use autoscaling and request batching (if latency allows).
  • Batch window: Input size / parallelism / per-record cost. Leave margin for retries.

Implementation checklist

  • Define SLA: latency target (P95), error budget, throughput range.
  • Decide pattern: sync, async, streaming, batch, or hybrid.
  • Inputs/outputs: schemas, idempotency key, versioning.
  • Scaling: min/max replicas; GPU/CPU; concurrency per worker.
  • Resilience: retries, timeouts, circuit breakers, dead-letter queues.
  • Observability: request IDs, structured logs, metrics, traces, model/feature version tags.
  • Rollouts: canary, shadow, rollback plan.
  • Data management: feature freshness, training-serving skew checks, drift alerts.

Exercises

Everyone can use the exercises and test. Only logged-in users have their progress saved.

Exercise 1 (ex1): Pick the right serving pattern

For each scenario, choose a serving pattern and justify with SLA, cost, and data freshness.

  • A) Mobile app displays risk score during signup (target P95 200 ms).
  • B) Marketing team wants new lead scores by 8 AM daily.
  • C) IoT sensors stream temperature data; alert if anomaly within 10 seconds.

Deliverable: a short table or bullets mapping scenario → pattern → 2–3 reasons.

Exercise 2 (ex2): Design minimal interfaces

Design both an online endpoint and a batch job for the same churn model.

  • Online: define request/response JSON with idempotency key and version.
  • Batch: define input table fields, output table fields, and daily schedule.
  • Include scaling triggers and a retry policy.

Common mistakes and self-check

  • Mistake: Using online sync for workloads with minute-level tolerance. Self-check: Is user waiting? If not, consider async/batch.
  • Mistake: Ignoring feature freshness. Self-check: Document max acceptable data staleness per feature.
  • Mistake: No idempotency for async/batch. Self-check: Can you safely retry the same request/job?
  • Mistake: Unbounded latency due to cold starts. Self-check: Min replicas, warm-up probes, and caches configured?
  • Mistake: No rollout strategy. Self-check: Canary/shadow and rollback documented?

Practical projects

  • Build a REST model service with a 150 ms P95 target and autoscaling. Add structured logs and P95 metrics.
  • Create a nightly batch scoring pipeline that writes to a versioned table and generates a run report with success/fail counts.
  • Implement a hybrid recommender: batch candidate generation + online reranker with canary deployment.
  • Set up an async queue-based endpoint for long-running image classification with callback notification.

Learning path

  • Start: Understand SLAs and traffic patterns; pick the serving pattern.
  • Next: Define interfaces and schemas; add idempotency and versioning.
  • Then: Add observability and autoscaling; define rollout strategy.
  • Finally: Implement drift monitoring and data-quality checks.

Next steps

  • Complete the exercises; compare with the provided solutions.
  • Take the quick test below to confirm understanding.
  • Apply a chosen pattern to a small internal project this week.

Mini challenge

You need to classify support tickets into categories. Response time within 2 minutes is fine. Volume spikes during the day. Propose a serving pattern, minimal architecture, and two resilience mechanisms. Keep it to 5–7 bullet points.

Practice Exercises

2 exercises to complete

Instructions

For each scenario, choose a serving pattern and justify it with SLA, cost, and data freshness.

  • A) Mobile app displays risk score during signup (target P95 200 ms).
  • B) Marketing wants new lead scores by 8 AM daily.
  • C) IoT sensors stream temperature data; alert if anomaly within 10 seconds.

Deliverable: bullets mapping scenario → pattern → 2–3 reasons.

Expected Output
A short list mapping A/B/C to patterns (e.g., online sync, batch, streaming) with 2–3 reasons each.

Serving Patterns Online And Batch — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Serving Patterns Online And Batch?

AI Assistant

Ask questions about this tool