luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Batch Versus Online Inference

Learn Batch Versus Online Inference for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Who this is for

NLP engineers, MLEs, and data engineers who need to decide how to run model inference at scale: in scheduled batches or in real time.

Prerequisites

  • Basic understanding of NLP model inputs/outputs (e.g., text in, label or embedding out)
  • Familiarity with HTTP APIs, queues, or job schedulers
  • Comfort with latency, throughput, and cost trade-offs

Why this matters

Choosing batch vs online inference affects latency, user experience, infrastructure cost, and team operations. As an NLP Engineer you will:

  • Moderate user content in real time (online)
  • Reprocess historical data for new features (batch)
  • Generate embeddings for search and recommenders (often batch, queries online)
  • Personalize experiences under strict SLAs (usually online)

Concept explained simply

Batch inference runs many items together on a schedule or trigger. It optimizes throughput and cost. Online inference handles one request at a time as it arrives, optimizing latency and user experience.

Mental model: Batch is a conveyor belt that moves when full or at set times; online is a dedicated cashier serving you immediately. Batch is cheap per item and steady; online is responsive but needs always-on capacity.
Key differences at a glance
  • Latency: Batch (minutes-hours) vs Online (milliseconds-seconds)
  • Throughput: Batch excels at large volumes; Online optimized for concurrency
  • Freshness: Batch is as fresh as the last run; Online is up-to-the-request
  • Cost: Batch packs GPUs/CPUs efficiently; Online pays for readiness and spikes
  • Complexity: Batch uses schedulers and retries; Online needs autoscaling and tail-latency control

How to choose: decision checklist

  • Is an immediate user-visible response required? Choose online.
  • Can results wait minutes or hours? Choose batch.
  • Do you need to process a backlog or full historical dataset? Choose batch.
  • Is your traffic spiky with strict P95/P99 SLAs? Online with autoscaling.
  • Do you have tight cost targets with predictable demand? Batch windows.
  • Are features only available offline (e.g., data warehouse joins)? Batch.
  • Do you need idempotency and dedup guarantees? Both, but design differs.
  • Can you micro-batch (tiny windows) to get best of both? Consider streaming/micro-batch.
Micro-batching in practice

Accumulate small groups (e.g., 100-1000 items) for 100-1000 ms, run a vectorized forward pass, return results with near-real-time latency. Great for streaming logs and chat moderation with small tolerance.

Worked examples

Example 1: Nightly product title normalization

Goal: Clean and standardize 5 million product titles daily.

  1. Mode: Batch (no immediate UX impact).
  2. Window: Nightly 2-hour window.
  3. Throughput planning: 5,000,000 items / 7,200 s ≈ 694 QPS. If one worker handles 60 QPS, need ≈ 12 workers (round up for headroom: 14-16).
  4. Ops: Use a job queue, chunk by 5k items, checkpoint progress, idempotent writes.
Why not online?

Online would keep capacity hot 24/7, costing more with no UX benefit.

Example 2: Real-time toxic comment moderation

Goal: Block abusive comments within 300 ms at P95.

  1. Mode: Online.
  2. Budget: End-to-end P95 300 ms. Allocate 120-180 ms for model, rest for network/logic.
  3. Scaling: Estimate peak RPS, set min instances to avoid cold starts, autoscale on concurrency.
  4. Resilience: Timeouts at 250 ms, fallback to stricter rules if model misses SLA.
Tail latency tactics
  • Batch size = 1; or tiny micro-batch if stable
  • Warm instances or reserved concurrency
  • Circuit breaker to degrade gracefully

Example 3: Semantic search embeddings

Goal: Keep document embeddings fresh and support low-latency queries.

  1. Indexing: Batch-generate embeddings for new/updated documents every hour.
  2. Queries: Online retrieval with P95 < 200 ms, using the precomputed vectors.
  3. Backfill: Batch re-embed full corpus after model upgrades.
  4. Cost: Batch leverages large GPU nodes efficiently; queries use CPU/GPU based on target latency.
Why hybrid wins

Index updates are bulk and predictable (batch). Searches are user-triggered and latency-sensitive (online).

Reference patterns

Batch pipeline (step-by-step)
  1. Trigger: Schedule or file arrival
  2. Chunking: Split items into batches (e.g., 1024)
  3. Workers: Pull-from-queue, vectorized inference
  4. Checkpoint: Record offsets, retries with backoff
  5. Write: Idempotent upserts
  6. Report: Throughput, error rate, cost per 1k items
Online service (step-by-step)
  1. Endpoint: Low-latency HTTP/gRPC
  2. Autoscaling: Based on RPS/concurrency
  3. Warm pool: Min instances to avoid cold start
  4. Observability: P50/P95/P99, timeouts
  5. Fallbacks: Rules or cached responses
  6. Rate limits: Protect from overload

Key metrics and SLOs

  • Latency: P50/P95/P99 (online), window completion time (batch)
  • Throughput: QPS/TPS, items per batch, utilization
  • Freshness: Age of last processed item, lag
  • Reliability: Success rate, retry rate, DLQ size
  • Cost: Cost per 1k tokens/items; GPU-hours per run
Little’s Law intuition

Average concurrency ≈ Arrival rate × Average service time. Helpful for sizing online instances.

Common mistakes and self-check

  • Mistake: Forcing online when batch is fine. Self-check: Is there a user waiting?
  • Mistake: Ignoring tail latency. Self-check: Monitor P95/P99, not only averages.
  • Mistake: No idempotency. Self-check: Can retries create duplicates? Use stable IDs.
  • Mistake: Oversized batch windows. Self-check: Freshness acceptable? Reduce window or micro-batch.
  • Mistake: No fallback plan. Self-check: What happens on timeout?

Hands-on exercises

Do these now. You can check solutions in the collapsible sections. These mirror the exercises available below the article.

Exercise 1: Pick the right mode

For each scenario, decide Batch, Online, or Hybrid. Justify in one sentence.

  • 1) Flag hate speech in live chat with P95 ≤ 200 ms
  • 2) Recompute topic labels for 12 months of forum posts
  • 3) Generate weekly email subject lines for campaigns
  • 4) Update product attribute extraction when schema changes
  • 5) Auto-translate support tickets as they arrive; agents wait for result
Show solution

Suggested answers: 1) Online; 2) Batch; 3) Batch; 4) Batch (backfill) + Online for new items (Hybrid); 5) Online.

Exercise 2: Throughput sizing

You must process 1.2M documents in 2 hours. One worker sustains 50 QPS. What’s the minimum workers? Any safety headroom?

Show solution

Required QPS = 1,200,000 / 7,200 ≈ 166.7 QPS. Workers = 166.7 / 50 = 3.33 → 4. Add 20-30% headroom: 5 workers.

Checklist: before you ship
  • Chosen mode documented with SLA and cost rationale
  • Retry + idempotency strategy defined
  • Metrics and alerts for P95/P99 or job lag
  • Capacity plan with headroom
  • Fallback behavior on timeouts

Practical projects

  • Build a nightly sentiment pipeline on past day’s reviews with a progress dashboard.
  • Serve a real-time toxicity API with autoscaling and a rules fallback under 300 ms P95.
  • Hybrid: Batch-generate embeddings hourly; serve semantic search queries online.

Learning path

  • Start: Understand SLAs and traffic patterns
  • Then: Prototype both batch and online for a small subset
  • Next: Add observability and idempotent writes
  • Finally: Optimize cost (batch size, autoscaling, warm pools)

Next steps

  • Decide modes for your top 3 NLP use cases
  • Implement one batch and one online pipeline
  • Add monitoring and run a load test

Mini challenge

You launch a summarization feature in a dashboard. Users click “Summarize” and expect results in under 2 seconds, but you also need daily summaries emailed every morning. Design your mix:

Suggested approach
  • Online: On-click summarization with tight timeout and fallback to partial extractive summary
  • Batch: Nightly summaries generated and cached for next-day email
  • Share model weights; tune batch size and caching separately

Quick Test

The quick test below is available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Decide the best inference mode for each scenario and justify briefly.

  • 1) Live chat abuse detection (P95 ≤ 200 ms)
  • 2) Backfill language detection for 2 years of logs
  • 3) Weekly topic digest for newsletters
  • 4) Re-embed entire document corpus after model upgrade
  • 5) Translate support tickets as agents wait
Expected Output
A list of 5 decisions (Batch/Online/Hybrid) with one-sentence justifications.

Batch Versus Online Inference — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Batch Versus Online Inference?

AI Assistant

Ask questions about this tool