Who this is for
This lesson is for Machine Learning Engineers and backend-focused Data Scientists who need to decide how to serve model predictions in production systems and communicate those trade-offs with product and platform teams.
Prerequisites
- Basic understanding of ML model training and evaluation (classification/regression).
- Familiarity with REST/gRPC APIs and message queues or schedulers.
- Comfort with reading simple latency/throughput metrics (p95/p99, RPS, batch size).
Why this matters
Real tasks you will own:
- Choose whether a fraud detection model should run on every transaction (online) or hourly in bulk (batch).
- Design APIs and data flows that meet strict SLAs without overspending on compute.
- Coordinate offline feature pipelines with online features so predictions match training-time data.
- Plan backfills, re-scores, and A/B tests safely without blocking user-facing traffic.
Concept explained simply
Batch inference runs the model on many records at once on a schedule or trigger and stores the results for later use. Online (real-time) inference runs the model per request and returns a prediction immediately to a caller.
- Batch: High throughput, lower cost per prediction, higher latency (minutes to hours), great for periodic decisions and large volumes.
- Online: Low latency (milliseconds), higher cost per prediction, great for user-facing decisions that change the UX.
Mental model
Pick based on how tightly the prediction must couple to a user action. Tight coupling (user waits) -> Online inference Loose coupling (can wait) -> Batch inference Then check constraints: - Latency budget? - Freshness needs? - Cost and traffic patterns? - Consistency with training features? - Failure modes and fallbacks?
Decision checklist (open me when choosing)
- Latency budget: What is p95 allowed? (e.g., 100 ms end-to-end)
- Freshness: How old can features/predictions be? (e.g., 1 hour, 1 day)
- Throughput: How many predictions per minute/hour/day?
- Cost: Can you batch to lower compute cost? Any idle time to eliminate?
- Consistency: Do offline and online features match definitions?
- Failure handling: Cached fallback? Default score? Queue and retry?
- Auditability: Do you need a full record of inputs/predictions?
Worked examples
1) Nightly churn risk scores (Batch)
- Goal: Email retention offers next morning.
- Constraints: No user is waiting; millions of users.
- Design: Nightly job reads user features, predicts churn in bulk, writes scores to a table/warehouse and downstream CRM.
- Why batch: Highest throughput and lowest cost; freshness is acceptable (24h).
2) Checkout fraud decision (Online)
- Goal: Approve/decline within the checkout request.
- Constraints: p95 < 120 ms end-to-end; high correctness; traffic spikes.
- Design: Low-latency API with warm instances, cached features (e.g., account age, velocity), and circuit breaker with safe defaults.
- Why online: Decision blocks user; must be immediate.
3) Recommendations (Hybrid)
- Goal: Personalized product lists on page load.
- Design: Batch precomputes user/item embeddings and nearest-neighbor indexes. Online service uses fresh context (page/category) to select from precomputed candidates.
- Why hybrid: Heavy compute done offline; light personalization done online to meet latency and cost targets.
4) Demand forecasting (Batch) with alerting (Online-ish)
- Goal: Daily replenishment decisions; alert on sudden anomalies.
- Design: Batch daily forecast; separate lightweight streaming rule-based alert for spikes.
- Why: Main decision doesn't need real-time; alerts do, but not full model inference.
How to choose in 5 steps
- Write the latency budget: Include network, auth, feature fetch, model compute, serialization.
- Define freshness: Feature and prediction staleness tolerances.
- Estimate traffic: RPS for online, volume/batch size for batch; plan p95/p99.
- Map failure modes: Timeouts, partial data, model unavailability; pick fallback.
- Run a small load test: Measure real timings and costs before committing.
Architecture templates
Batch template
Scheduler -> Feature Store/Warehouse -> Batch Compute (Spark/Dask/DB proc)
-> Write predictions table -> Downstream sync (CRM, cache, DB)
Key controls: idempotent writes, partitioning by date, lineage logs.
Online template
Client -> API Gateway -> Inference Service -> Feature Cache/Store
|-> Model (in-memory)
|-> Metrics/Tracing
Key controls: autoscaling, warm starts, timeouts, fallbacks, canary.
Hybrid template
Batch: build embeddings/candidates -> Cache/Index Online: fetch candidates + light rerank -> return Key controls: index refresh schedule, versioning, feature parity tests.
Common mistakes and self-check
- Mistake: Using online inference when the user isn’t waiting. Fix: Move to batch and cache results.
- Mistake: Feature drift between training and serving. Fix: Shared feature definitions and parity tests.
- Mistake: Ignoring tail latency. Fix: Measure p95/p99; set timeouts and fallbacks.
- Mistake: No idempotency in batch writes. Fix: Use stable keys and upserts.
- Mistake: Over-provisioning always-on GPUs. Fix: Right-size and batch, or use hybrid.
Self-check:
- Can you state the exact latency and freshness budgets?
- Do you have a fallback for model or feature store outages?
- Can you reproduce a prediction from logs for auditing?
- Did you verify feature parity across batch and online?
Exercises
These mirror the exercises section below. Try here first, then compare with the provided solutions.
Exercise 1: Classify scenarios
- A) Weekly credit limit reviews
- B) Content moderation for a live chat message
- C) Pricing suggestions shown after the seller clicks “List item”
- D) Personalized email subject lines sent daily
Decide: Batch, Online, or Hybrid. One sentence of justification each.
Exercise 2: Design the I/O
Design inputs/outputs and latency targets for a real-time fraud check endpoint and a nightly risk score batch job. Include fallback behavior.
- [ ] Wrote latency and freshness budgets
- [ ] Picked serving mode and justified it
- [ ] Defined failure modes and fallbacks
- [ ] Sketched metrics to monitor
Practical projects
- Batch: Build a scheduled job that scores a million records, writes to a partitioned table, and supports idempotent reruns.
- Online: Deploy a small inference API with request/response logging, p95 latency SLI, and a cached default fallback.
- Hybrid: Precompute embeddings offline; serve a top-K reranker online; measure the cost and latency delta.
Learning path
- Before: Request/response design, feature stores, and model packaging.
- This lesson: Decide serving mode and design around latency, freshness, and cost.
- After: Autoscaling strategies, canary releases, observability (metrics, tracing, drift alarms).
Next steps
- Run a capacity test for your current service and document p95/p99.
- Add a parity test comparing batch vs online feature values for 1000 entities.
- Implement a simple circuit breaker with a safe default prediction.
Mini challenge
Your product team wants “instant” shipping ETA on the order confirmation page. Features include user address, carrier capacity (refreshes every 5 minutes), and historical delivery times. Latency budget is 150 ms p95. Choose Batch, Online, or Hybrid and justify.
View sample answer
Hybrid: Precompute base ETAs batch from historicals; fetch fresh carrier capacity online and adjust. This meets 150 ms with light online compute while keeping cost low.
Quick test
The quick test is available to everyone. If you are logged in, your progress will be saved.
When ready, start the Quick Test below.