Topic Not Found

Who this is for

NLP engineers, MLEs, and data engineers who need to decide how to run model inference at scale: in scheduled batches or in real time.

Prerequisites

Basic understanding of NLP model inputs/outputs (e.g., text in, label or embedding out)
Familiarity with HTTP APIs, queues, or job schedulers
Comfort with latency, throughput, and cost trade-offs

Why this matters

Choosing batch vs online inference affects latency, user experience, infrastructure cost, and team operations. As an NLP Engineer you will:

Moderate user content in real time (online)
Reprocess historical data for new features (batch)
Generate embeddings for search and recommenders (often batch, queries online)
Personalize experiences under strict SLAs (usually online)

Concept explained simply

Batch inference runs many items together on a schedule or trigger. It optimizes throughput and cost. Online inference handles one request at a time as it arrives, optimizing latency and user experience.

Mental model: Batch is a conveyor belt that moves when full or at set times; online is a dedicated cashier serving you immediately. Batch is cheap per item and steady; online is responsive but needs always-on capacity.

Key differences at a glance

Latency: Batch (minutes-hours) vs Online (milliseconds-seconds)
Throughput: Batch excels at large volumes; Online optimized for concurrency
Freshness: Batch is as fresh as the last run; Online is up-to-the-request
Cost: Batch packs GPUs/CPUs efficiently; Online pays for readiness and spikes
Complexity: Batch uses schedulers and retries; Online needs autoscaling and tail-latency control

How to choose: decision checklist

Is an immediate user-visible response required? Choose online.
Can results wait minutes or hours? Choose batch.
Do you need to process a backlog or full historical dataset? Choose batch.
Is your traffic spiky with strict P95/P99 SLAs? Online with autoscaling.
Do you have tight cost targets with predictable demand? Batch windows.
Are features only available offline (e.g., data warehouse joins)? Batch.
Do you need idempotency and dedup guarantees? Both, but design differs.
Can you micro-batch (tiny windows) to get best of both? Consider streaming/micro-batch.

Micro-batching in practice

Accumulate small groups (e.g., 100-1000 items) for 100-1000 ms, run a vectorized forward pass, return results with near-real-time latency. Great for streaming logs and chat moderation with small tolerance.

Worked examples

Example 1: Nightly product title normalization

Goal: Clean and standardize 5 million product titles daily.

Mode: Batch (no immediate UX impact).
Window: Nightly 2-hour window.
Throughput planning: 5,000,000 items / 7,200 s ≈ 694 QPS. If one worker handles 60 QPS, need ≈ 12 workers (round up for headroom: 14-16).
Ops: Use a job queue, chunk by 5k items, checkpoint progress, idempotent writes.

Why not online?

Online would keep capacity hot 24/7, costing more with no UX benefit.

Example 2: Real-time toxic comment moderation

Goal: Block abusive comments within 300 ms at P95.

Mode: Online.
Budget: End-to-end P95 300 ms. Allocate 120-180 ms for model, rest for network/logic.
Scaling: Estimate peak RPS, set min instances to avoid cold starts, autoscale on concurrency.
Resilience: Timeouts at 250 ms, fallback to stricter rules if model misses SLA.

Tail latency tactics

Batch size = 1; or tiny micro-batch if stable
Warm instances or reserved concurrency
Circuit breaker to degrade gracefully

Example 3: Semantic search embeddings

Goal: Keep document embeddings fresh and support low-latency queries.

Indexing: Batch-generate embeddings for new/updated documents every hour.
Queries: Online retrieval with P95 < 200 ms, using the precomputed vectors.
Backfill: Batch re-embed full corpus after model upgrades.
Cost: Batch leverages large GPU nodes efficiently; queries use CPU/GPU based on target latency.

Why hybrid wins

Index updates are bulk and predictable (batch). Searches are user-triggered and latency-sensitive (online).

Reference patterns

Batch pipeline (step-by-step)

Trigger: Schedule or file arrival
Chunking: Split items into batches (e.g., 1024)
Workers: Pull-from-queue, vectorized inference
Checkpoint: Record offsets, retries with backoff
Write: Idempotent upserts
Report: Throughput, error rate, cost per 1k items

Online service (step-by-step)

Endpoint: Low-latency HTTP/gRPC
Autoscaling: Based on RPS/concurrency
Warm pool: Min instances to avoid cold start
Observability: P50/P95/P99, timeouts
Fallbacks: Rules or cached responses
Rate limits: Protect from overload

Key metrics and SLOs

Latency: P50/P95/P99 (online), window completion time (batch)
Throughput: QPS/TPS, items per batch, utilization
Freshness: Age of last processed item, lag
Reliability: Success rate, retry rate, DLQ size
Cost: Cost per 1k tokens/items; GPU-hours per run

Little’s Law intuition

Average concurrency ≈ Arrival rate × Average service time. Helpful for sizing online instances.

Common mistakes and self-check

Mistake: Forcing online when batch is fine. Self-check: Is there a user waiting?
Mistake: Ignoring tail latency. Self-check: Monitor P95/P99, not only averages.
Mistake: No idempotency. Self-check: Can retries create duplicates? Use stable IDs.
Mistake: Oversized batch windows. Self-check: Freshness acceptable? Reduce window or micro-batch.
Mistake: No fallback plan. Self-check: What happens on timeout?

Hands-on exercises

Do these now. You can check solutions in the collapsible sections. These mirror the exercises available below the article.

Exercise 1: Pick the right mode

For each scenario, decide Batch, Online, or Hybrid. Justify in one sentence.

1) Flag hate speech in live chat with P95 ≤ 200 ms
2) Recompute topic labels for 12 months of forum posts
3) Generate weekly email subject lines for campaigns
4) Update product attribute extraction when schema changes
5) Auto-translate support tickets as they arrive; agents wait for result

Show solution

Suggested answers: 1) Online; 2) Batch; 3) Batch; 4) Batch (backfill) + Online for new items (Hybrid); 5) Online.

Exercise 2: Throughput sizing

You must process 1.2M documents in 2 hours. One worker sustains 50 QPS. What’s the minimum workers? Any safety headroom?

Show solution

Required QPS = 1,200,000 / 7,200 ≈ 166.7 QPS. Workers = 166.7 / 50 = 3.33 → 4. Add 20-30% headroom: 5 workers.

Checklist: before you ship

Chosen mode documented with SLA and cost rationale
Retry + idempotency strategy defined
Metrics and alerts for P95/P99 or job lag
Capacity plan with headroom
Fallback behavior on timeouts

Practical projects

Build a nightly sentiment pipeline on past day’s reviews with a progress dashboard.
Serve a real-time toxicity API with autoscaling and a rules fallback under 300 ms P95.
Hybrid: Batch-generate embeddings hourly; serve semantic search queries online.

Learning path

Start: Understand SLAs and traffic patterns
Then: Prototype both batch and online for a small subset
Next: Add observability and idempotent writes
Finally: Optimize cost (batch size, autoscaling, warm pools)

Next steps

Decide modes for your top 3 NLP use cases
Implement one batch and one online pipeline
Add monitoring and run a load test

Mini challenge

You launch a summarization feature in a dashboard. Users click “Summarize” and expect results in under 2 seconds, but you also need daily summaries emailed every morning. Design your mix:

Suggested approach

Online: On-click summarization with tight timeout and fallback to partial extractive summary
Batch: Nightly summaries generated and cached for next-day email
Share model weights; tune batch size and caching separately

Quick Test

The quick test below is available to everyone. Only logged-in users will have their progress saved.

Menu

Batch Versus Online Inference

Table of Contents