Who this is for
NLP engineers, MLEs, and data engineers who need to decide how to run model inference at scale: in scheduled batches or in real time.
Prerequisites
- Basic understanding of NLP model inputs/outputs (e.g., text in, label or embedding out)
- Familiarity with HTTP APIs, queues, or job schedulers
- Comfort with latency, throughput, and cost trade-offs
Why this matters
Choosing batch vs online inference affects latency, user experience, infrastructure cost, and team operations. As an NLP Engineer you will:
- Moderate user content in real time (online)
- Reprocess historical data for new features (batch)
- Generate embeddings for search and recommenders (often batch, queries online)
- Personalize experiences under strict SLAs (usually online)
Concept explained simply
Batch inference runs many items together on a schedule or trigger. It optimizes throughput and cost. Online inference handles one request at a time as it arrives, optimizing latency and user experience.
Key differences at a glance
- Latency: Batch (minutes-hours) vs Online (milliseconds-seconds)
- Throughput: Batch excels at large volumes; Online optimized for concurrency
- Freshness: Batch is as fresh as the last run; Online is up-to-the-request
- Cost: Batch packs GPUs/CPUs efficiently; Online pays for readiness and spikes
- Complexity: Batch uses schedulers and retries; Online needs autoscaling and tail-latency control
How to choose: decision checklist
- Is an immediate user-visible response required? Choose online.
- Can results wait minutes or hours? Choose batch.
- Do you need to process a backlog or full historical dataset? Choose batch.
- Is your traffic spiky with strict P95/P99 SLAs? Online with autoscaling.
- Do you have tight cost targets with predictable demand? Batch windows.
- Are features only available offline (e.g., data warehouse joins)? Batch.
- Do you need idempotency and dedup guarantees? Both, but design differs.
- Can you micro-batch (tiny windows) to get best of both? Consider streaming/micro-batch.
Micro-batching in practice
Accumulate small groups (e.g., 100-1000 items) for 100-1000 ms, run a vectorized forward pass, return results with near-real-time latency. Great for streaming logs and chat moderation with small tolerance.
Worked examples
Example 1: Nightly product title normalization
Goal: Clean and standardize 5 million product titles daily.
- Mode: Batch (no immediate UX impact).
- Window: Nightly 2-hour window.
- Throughput planning: 5,000,000 items / 7,200 s ≈ 694 QPS. If one worker handles 60 QPS, need ≈ 12 workers (round up for headroom: 14-16).
- Ops: Use a job queue, chunk by 5k items, checkpoint progress, idempotent writes.
Why not online?
Online would keep capacity hot 24/7, costing more with no UX benefit.
Example 2: Real-time toxic comment moderation
Goal: Block abusive comments within 300 ms at P95.
- Mode: Online.
- Budget: End-to-end P95 300 ms. Allocate 120-180 ms for model, rest for network/logic.
- Scaling: Estimate peak RPS, set min instances to avoid cold starts, autoscale on concurrency.
- Resilience: Timeouts at 250 ms, fallback to stricter rules if model misses SLA.
Tail latency tactics
- Batch size = 1; or tiny micro-batch if stable
- Warm instances or reserved concurrency
- Circuit breaker to degrade gracefully
Example 3: Semantic search embeddings
Goal: Keep document embeddings fresh and support low-latency queries.
- Indexing: Batch-generate embeddings for new/updated documents every hour.
- Queries: Online retrieval with P95 < 200 ms, using the precomputed vectors.
- Backfill: Batch re-embed full corpus after model upgrades.
- Cost: Batch leverages large GPU nodes efficiently; queries use CPU/GPU based on target latency.
Why hybrid wins
Index updates are bulk and predictable (batch). Searches are user-triggered and latency-sensitive (online).
Reference patterns
Batch pipeline (step-by-step)
- Trigger: Schedule or file arrival
- Chunking: Split items into batches (e.g., 1024)
- Workers: Pull-from-queue, vectorized inference
- Checkpoint: Record offsets, retries with backoff
- Write: Idempotent upserts
- Report: Throughput, error rate, cost per 1k items
Online service (step-by-step)
- Endpoint: Low-latency HTTP/gRPC
- Autoscaling: Based on RPS/concurrency
- Warm pool: Min instances to avoid cold start
- Observability: P50/P95/P99, timeouts
- Fallbacks: Rules or cached responses
- Rate limits: Protect from overload
Key metrics and SLOs
- Latency: P50/P95/P99 (online), window completion time (batch)
- Throughput: QPS/TPS, items per batch, utilization
- Freshness: Age of last processed item, lag
- Reliability: Success rate, retry rate, DLQ size
- Cost: Cost per 1k tokens/items; GPU-hours per run
Little’s Law intuition
Average concurrency ≈ Arrival rate × Average service time. Helpful for sizing online instances.
Common mistakes and self-check
- Mistake: Forcing online when batch is fine. Self-check: Is there a user waiting?
- Mistake: Ignoring tail latency. Self-check: Monitor P95/P99, not only averages.
- Mistake: No idempotency. Self-check: Can retries create duplicates? Use stable IDs.
- Mistake: Oversized batch windows. Self-check: Freshness acceptable? Reduce window or micro-batch.
- Mistake: No fallback plan. Self-check: What happens on timeout?
Hands-on exercises
Do these now. You can check solutions in the collapsible sections. These mirror the exercises available below the article.
Exercise 1: Pick the right mode
For each scenario, decide Batch, Online, or Hybrid. Justify in one sentence.
- 1) Flag hate speech in live chat with P95 ≤ 200 ms
- 2) Recompute topic labels for 12 months of forum posts
- 3) Generate weekly email subject lines for campaigns
- 4) Update product attribute extraction when schema changes
- 5) Auto-translate support tickets as they arrive; agents wait for result
Show solution
Suggested answers: 1) Online; 2) Batch; 3) Batch; 4) Batch (backfill) + Online for new items (Hybrid); 5) Online.
Exercise 2: Throughput sizing
You must process 1.2M documents in 2 hours. One worker sustains 50 QPS. What’s the minimum workers? Any safety headroom?
Show solution
Required QPS = 1,200,000 / 7,200 ≈ 166.7 QPS. Workers = 166.7 / 50 = 3.33 → 4. Add 20-30% headroom: 5 workers.
Checklist: before you ship
- Chosen mode documented with SLA and cost rationale
- Retry + idempotency strategy defined
- Metrics and alerts for P95/P99 or job lag
- Capacity plan with headroom
- Fallback behavior on timeouts
Practical projects
- Build a nightly sentiment pipeline on past day’s reviews with a progress dashboard.
- Serve a real-time toxicity API with autoscaling and a rules fallback under 300 ms P95.
- Hybrid: Batch-generate embeddings hourly; serve semantic search queries online.
Learning path
- Start: Understand SLAs and traffic patterns
- Then: Prototype both batch and online for a small subset
- Next: Add observability and idempotent writes
- Finally: Optimize cost (batch size, autoscaling, warm pools)
Next steps
- Decide modes for your top 3 NLP use cases
- Implement one batch and one online pipeline
- Add monitoring and run a load test
Mini challenge
You launch a summarization feature in a dashboard. Users click “Summarize” and expect results in under 2 seconds, but you also need daily summaries emailed every morning. Design your mix:
Suggested approach
- Online: On-click summarization with tight timeout and fallback to partial extractive summary
- Batch: Nightly summaries generated and cached for next-day email
- Share model weights; tune batch size and caching separately
Quick Test
The quick test below is available to everyone. Only logged-in users will have their progress saved.