Topic Not Found

Who this is for

This lesson is for Machine Learning Engineers and backend-focused Data Scientists who need to decide how to serve model predictions in production systems and communicate those trade-offs with product and platform teams.

Prerequisites

Basic understanding of ML model training and evaluation (classification/regression).
Familiarity with REST/gRPC APIs and message queues or schedulers.
Comfort with reading simple latency/throughput metrics (p95/p99, RPS, batch size).

Why this matters

Real tasks you will own:

Choose whether a fraud detection model should run on every transaction (online) or hourly in bulk (batch).
Design APIs and data flows that meet strict SLAs without overspending on compute.
Coordinate offline feature pipelines with online features so predictions match training-time data.
Plan backfills, re-scores, and A/B tests safely without blocking user-facing traffic.

Concept explained simply

Batch inference runs the model on many records at once on a schedule or trigger and stores the results for later use. Online (real-time) inference runs the model per request and returns a prediction immediately to a caller.

Batch: High throughput, lower cost per prediction, higher latency (minutes to hours), great for periodic decisions and large volumes.
Online: Low latency (milliseconds), higher cost per prediction, great for user-facing decisions that change the UX.

Mental model

Pick based on how tightly the prediction must couple to a user action.

Tight coupling (user waits)  -> Online inference
Loose coupling (can wait)    -> Batch inference

Then check constraints:
- Latency budget?
- Freshness needs?
- Cost and traffic patterns?
- Consistency with training features?
- Failure modes and fallbacks?

Decision checklist (open me when choosing)

Latency budget: What is p95 allowed? (e.g., 100 ms end-to-end)
Freshness: How old can features/predictions be? (e.g., 1 hour, 1 day)
Throughput: How many predictions per minute/hour/day?
Cost: Can you batch to lower compute cost? Any idle time to eliminate?
Consistency: Do offline and online features match definitions?
Failure handling: Cached fallback? Default score? Queue and retry?
Auditability: Do you need a full record of inputs/predictions?

Worked examples

1) Nightly churn risk scores (Batch)

Goal: Email retention offers next morning.
Constraints: No user is waiting; millions of users.
Design: Nightly job reads user features, predicts churn in bulk, writes scores to a table/warehouse and downstream CRM.
Why batch: Highest throughput and lowest cost; freshness is acceptable (24h).

2) Checkout fraud decision (Online)

Goal: Approve/decline within the checkout request.
Constraints: p95 < 120 ms end-to-end; high correctness; traffic spikes.
Design: Low-latency API with warm instances, cached features (e.g., account age, velocity), and circuit breaker with safe defaults.
Why online: Decision blocks user; must be immediate.

3) Recommendations (Hybrid)

Goal: Personalized product lists on page load.
Design: Batch precomputes user/item embeddings and nearest-neighbor indexes. Online service uses fresh context (page/category) to select from precomputed candidates.
Why hybrid: Heavy compute done offline; light personalization done online to meet latency and cost targets.

4) Demand forecasting (Batch) with alerting (Online-ish)

Goal: Daily replenishment decisions; alert on sudden anomalies.
Design: Batch daily forecast; separate lightweight streaming rule-based alert for spikes.
Why: Main decision doesn't need real-time; alerts do, but not full model inference.

How to choose in 5 steps

Write the latency budget: Include network, auth, feature fetch, model compute, serialization.
Define freshness: Feature and prediction staleness tolerances.
Estimate traffic: RPS for online, volume/batch size for batch; plan p95/p99.
Map failure modes: Timeouts, partial data, model unavailability; pick fallback.
Run a small load test: Measure real timings and costs before committing.

Architecture templates

Batch template

Scheduler -> Feature Store/Warehouse -> Batch Compute (Spark/Dask/DB proc)
         -> Write predictions table -> Downstream sync (CRM, cache, DB)

Key controls: idempotent writes, partitioning by date, lineage logs.

Online template

Client -> API Gateway -> Inference Service -> Feature Cache/Store
                               |-> Model (in-memory)
                               |-> Metrics/Tracing

Key controls: autoscaling, warm starts, timeouts, fallbacks, canary.

Hybrid template

Batch: build embeddings/candidates -> Cache/Index
Online: fetch candidates + light rerank -> return

Key controls: index refresh schedule, versioning, feature parity tests.

Common mistakes and self-check

Mistake: Using online inference when the user isn’t waiting. Fix: Move to batch and cache results.
Mistake: Feature drift between training and serving. Fix: Shared feature definitions and parity tests.
Mistake: Ignoring tail latency. Fix: Measure p95/p99; set timeouts and fallbacks.
Mistake: No idempotency in batch writes. Fix: Use stable keys and upserts.
Mistake: Over-provisioning always-on GPUs. Fix: Right-size and batch, or use hybrid.

Self-check:

Can you state the exact latency and freshness budgets?
Do you have a fallback for model or feature store outages?
Can you reproduce a prediction from logs for auditing?
Did you verify feature parity across batch and online?

Exercises

These mirror the exercises section below. Try here first, then compare with the provided solutions.

Exercise 1: Classify scenarios

A) Weekly credit limit reviews
B) Content moderation for a live chat message
C) Pricing suggestions shown after the seller clicks “List item”
D) Personalized email subject lines sent daily

Decide: Batch, Online, or Hybrid. One sentence of justification each.

Exercise 2: Design the I/O

Design inputs/outputs and latency targets for a real-time fraud check endpoint and a nightly risk score batch job. Include fallback behavior.

[ ] Wrote latency and freshness budgets
[ ] Picked serving mode and justified it
[ ] Defined failure modes and fallbacks
[ ] Sketched metrics to monitor

Practical projects

Batch: Build a scheduled job that scores a million records, writes to a partitioned table, and supports idempotent reruns.
Online: Deploy a small inference API with request/response logging, p95 latency SLI, and a cached default fallback.
Hybrid: Precompute embeddings offline; serve a top-K reranker online; measure the cost and latency delta.

Learning path

Before: Request/response design, feature stores, and model packaging.
This lesson: Decide serving mode and design around latency, freshness, and cost.
After: Autoscaling strategies, canary releases, observability (metrics, tracing, drift alarms).

Next steps

Run a capacity test for your current service and document p95/p99.
Add a parity test comparing batch vs online feature values for 1000 entities.
Implement a simple circuit breaker with a safe default prediction.

Mini challenge

Your product team wants “instant” shipping ETA on the order confirmation page. Features include user address, carrier capacity (refreshes every 5 minutes), and historical delivery times. Latency budget is 150 ms p95. Choose Batch, Online, or Hybrid and justify.

View sample answer

Hybrid: Precompute base ETAs batch from historicals; fetch fresh carrier capacity online and adjust. This meets 150 ms with light online compute while keeping cost low.

Quick test

The quick test is available to everyone. If you are logged in, your progress will be saved.

When ready, start the Quick Test below.

Menu

Batch Versus Online Inference

Table of Contents