luvv to helpDiscover the Best Free Online Tools
Topic 1 of 10

Batch And Streaming Basics

Learn Batch And Streaming Basics for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Machine Learning Engineer, you will move data from sources to features, models, and dashboards. Choosing batch or streaming affects cost, complexity, freshness, and reliability. You will design training data generation, feature stores, alerting systems, and real-time inference. Get this right early to avoid rework and outages.

  • Daily model training: batch
  • Real-time fraud scoring: streaming
  • User behavior dashboards with a 5–10 minute delay: micro-batch or streaming with windows

Who this is for

  • New ML Engineers learning data pipelines
  • Data Scientists preparing features and training sets
  • Backend engineers moving into ML systems

Prerequisites

  • Basic Python or SQL familiarity
  • Understanding of files, tables, and timestamps
  • Comfort with simple ETL concepts (extract, transform, load)

Concept explained simply

Batch processes data in chunks on a schedule (e.g., hourly, daily). Streaming processes events continuously as they arrive.

Quick mental model
  • Batch = take a snapshot, clean it, compute results, write outputs. Great for completeness and heavy transforms.
  • Streaming = watch an event river and update results as water flows. Great for freshness and alerting.

Core terms

  • Latency: how long until data shows up in results.
  • Throughput: how much data you can process per unit time.
  • Micro-batch: tiny scheduled batches (e.g., every 1–5 minutes). A middle ground.
  • Event time vs processing time: event time is when something happened; processing time is when your system saw it.
  • Windows (streaming): group events by time ranges to aggregate in real time.
    • Tumbling: fixed, non-overlapping windows (e.g., [12:00–12:05), [12:05–12:10)).
    • Sliding: overlapping windows (e.g., 5-min window sliding every 1 min).
    • Session: gap-based windows that close after inactivity (e.g., 30s of no events).
  • Watermarks: a system’s guess that events older than a point are unlikely to arrive, used to close windows.
  • Delivery semantics: at-most-once (may drop), at-least-once (may duplicate), exactly-once (effects are as-if once).
  • Backpressure: when downstream can’t keep up, upstream must slow down or buffer.
  • Idempotency: repeating the same write has the same final effect (important for retries).
When to choose batch vs streaming
  • Choose batch if: you need complete data, heavy joins, daily/weekly outputs, cost-sensitive workloads.
  • Choose streaming if: you need freshness (alerts, real-time features), low-latency decisions, continuous metrics.
  • Choose micro-batch if: 1–10 minute freshness is fine and you prefer simpler ops.
Reliability checklist for streaming
  • Define event-time, late data policy, and watermark
  • Use idempotent writes or deduplication keys
  • Plan for backpressure (buffers, rate limits)
  • Define retries and dead-letter handling

Worked examples

Example 1: Daily sales report (Batch)

Goal: compute yesterday’s total sales per store.

  • Trigger: daily at 01:00
  • Inputs: transactions table for the previous day
  • Ops: clean, join with store table, aggregate
  • Output: sales_by_store_date table for BI
  • Why batch: complete, stable data; heavy joins; no urgency
What could go wrong?
  • Late-arriving transactions: include a small lookback (e.g., reprocess last 2 days) and dedup by transaction_id.

Example 2: Fraud alerts within seconds (Streaming)

Goal: flag risky transactions within 2 seconds.

  • Trigger: event-driven (each transaction)
  • Ops: enrich with recent user/device state; score model; emit alerts
  • Windows: last 5 minutes to compute velocity features (e.g., count per card)
  • Why streaming: low-latency decisioning
Key reliability choices
  • Event-time windows + watermark to handle out-of-order events
  • At-least-once processing + idempotent alert writes to avoid duplicates

Example 3: Product analytics with 5-minute freshness (Micro-batch)

Goal: update dashboard with active users per country every 5 minutes.

  • Trigger: every 5 minutes
  • Ops: read last N minutes of events; dedup; aggregate
  • Why micro-batch: simple, good enough freshness, cost-effective
Edge cases
  • Clock skew: rely on event_time and add a small overlap/lookback (e.g., 2 minutes) with deduplication by event_id.

Example 4: Hybrid for ML features

Goal: serve real-time features while keeping a clean offline history.

  • Streaming layer updates the online feature store within seconds.
  • Batch layer backfills and corrects features for training data daily.
  • Both use the same feature definitions to ensure consistency.
Why hybrid?

Streaming for freshness, batch for completeness and correction of late data or historical fixes.

Design checklist

Exercises

Note: Everyone can take the exercises and Quick Test. If you log in, your progress is saved.

Exercise 1: Pick batch, micro-batch, or streaming

For each scenario, choose the best processing mode and justify in one sentence:

  1. Weekly marketing cohort tables.
  2. Updating a recommendation carousel with clicks and purchases under 30 seconds.
  3. CEO dashboard that refreshes every 10 minutes.
Tips
  • Consider freshness target.
  • Consider completeness and heavy joins.
  • Consider operational simplicity.

Exercise 2: Windowing and late data policy

We have a 5-minute tumbling window based on event_time starting at 12:00. Allowed lateness: 2 minutes. Windows close at window_end + 2 minutes.

Events (event_time, arrival_time, amount):

  • e1: 12:00:05, 12:00:06, 10
  • e2: 12:04:59, 12:05:00, 5
  • e3: 12:05:10, 12:05:11, 8
  • e4: 12:04:30, 12:06:00, 7 (late)
  • e5: 12:09:59, 12:10:01, 3

Questions:

  • Q1: What are the sums for windows [12:00–12:05) and [12:05–12:10)?
  • Q2: Is e4 included or dropped? Why?
  • Q3: Write a short SQL-like snippet to compute 5-minute tumbling sums by event_time.
Hint

Window [12:00–12:05) closes at 12:07:00. Window [12:05–12:10) closes at 12:12:00.

Common mistakes (and how to self-check)

  • Mistake: Using processing_time for user analytics where clocks differ. Self-check: Do results shift if ingestion delays change? If yes, switch to event_time.
  • Mistake: No dedup keys with at-least-once. Self-check: Do occasional duplicates appear after retries? Add idempotent keys (e.g., event_id or business key).
  • Mistake: Closing windows too early. Self-check: Compare late-data rate; if >1–2%, increase allowed lateness or add correction logic.
  • Mistake: Ignoring backpressure. Self-check: Does lag grow during peaks? Add buffering, rate limits, or scale consumers.
  • Mistake: Reprocessing batch outputs without overwrite policies. Self-check: Do you see double counts after reruns? Use partition overwrite + dedup.
Quick self-audit mini-check

Practical projects

  • Build a daily feature generation batch job: load clean events for the prior day, join with dimensions, produce a training table with stable IDs and dedup logic.
  • Create a streaming counter: maintain per-user 5-minute rolling counts with event_time windows and a 2-minute watermark; write idempotently to a key-value store.
  • Hybrid correction: stream writes near-real-time features; nightly batch reprocesses last 3 days and corrects counts for late events.

Learning path

  • Now: Batch vs streaming basics (this page)
  • Next: Data ingestion patterns (files, topics, CDC), schema evolution
  • Then: Feature stores and consistency between offline/online
  • Advanced: Exactly-once effects, state management, backfills, and reprocessing

Mini challenge

You must power a churn-risk widget on the website showing "+1 risk" when a user has 3+ failures in the last 10 minutes. Average traffic is moderate; peak traffic spikes 3x. You also need daily aggregates for model training.

  • Pick a mode for the widget (streaming vs micro-batch) and why.
  • Define the windowing (type and size) and lateness policy.
  • Explain how you will correct counts the next day for late events.
One possible direction

Streaming with 10-minute tumbling or sliding windows using event_time and a small allowed lateness; idempotent writes. Nightly batch reprocesses last day to correct and backfill training tables.

Next steps

  1. Complete the exercises above and compare with the solutions.
  2. Take the Quick Test to check retention (progress saved if logged in).
  3. Draft a simple pipeline design doc for your use case using the checklist.

Practice Exercises

2 exercises to complete

Instructions

Map each scenario to a processing mode and justify with one sentence about freshness, completeness, or complexity.

  1. Weekly marketing cohort tables.
  2. Recommendation carousel updates within 30 seconds based on clicks and purchases.
  3. Executive dashboard refreshing every 10 minutes.
Expected Output
A mapping like: 1) Batch, 2) Streaming, 3) Micro-batch; plus a one-sentence justification for each.

Batch And Streaming Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Batch And Streaming Basics?

AI Assistant

Ask questions about this tool