Why this matters
As a Machine Learning Engineer, you will move data from sources to features, models, and dashboards. Choosing batch or streaming affects cost, complexity, freshness, and reliability. You will design training data generation, feature stores, alerting systems, and real-time inference. Get this right early to avoid rework and outages.
- Daily model training: batch
- Real-time fraud scoring: streaming
- User behavior dashboards with a 5–10 minute delay: micro-batch or streaming with windows
Who this is for
- New ML Engineers learning data pipelines
- Data Scientists preparing features and training sets
- Backend engineers moving into ML systems
Prerequisites
- Basic Python or SQL familiarity
- Understanding of files, tables, and timestamps
- Comfort with simple ETL concepts (extract, transform, load)
Concept explained simply
Batch processes data in chunks on a schedule (e.g., hourly, daily). Streaming processes events continuously as they arrive.
Quick mental model
- Batch = take a snapshot, clean it, compute results, write outputs. Great for completeness and heavy transforms.
- Streaming = watch an event river and update results as water flows. Great for freshness and alerting.
Core terms
- Latency: how long until data shows up in results.
- Throughput: how much data you can process per unit time.
- Micro-batch: tiny scheduled batches (e.g., every 1–5 minutes). A middle ground.
- Event time vs processing time: event time is when something happened; processing time is when your system saw it.
- Windows (streaming): group events by time ranges to aggregate in real time.
- Tumbling: fixed, non-overlapping windows (e.g., [12:00–12:05), [12:05–12:10)).
- Sliding: overlapping windows (e.g., 5-min window sliding every 1 min).
- Session: gap-based windows that close after inactivity (e.g., 30s of no events).
- Watermarks: a system’s guess that events older than a point are unlikely to arrive, used to close windows.
- Delivery semantics: at-most-once (may drop), at-least-once (may duplicate), exactly-once (effects are as-if once).
- Backpressure: when downstream can’t keep up, upstream must slow down or buffer.
- Idempotency: repeating the same write has the same final effect (important for retries).
When to choose batch vs streaming
- Choose batch if: you need complete data, heavy joins, daily/weekly outputs, cost-sensitive workloads.
- Choose streaming if: you need freshness (alerts, real-time features), low-latency decisions, continuous metrics.
- Choose micro-batch if: 1–10 minute freshness is fine and you prefer simpler ops.
Reliability checklist for streaming
- Define event-time, late data policy, and watermark
- Use idempotent writes or deduplication keys
- Plan for backpressure (buffers, rate limits)
- Define retries and dead-letter handling
Worked examples
Example 1: Daily sales report (Batch)
Goal: compute yesterday’s total sales per store.
- Trigger: daily at 01:00
- Inputs: transactions table for the previous day
- Ops: clean, join with store table, aggregate
- Output: sales_by_store_date table for BI
- Why batch: complete, stable data; heavy joins; no urgency
What could go wrong?
- Late-arriving transactions: include a small lookback (e.g., reprocess last 2 days) and dedup by transaction_id.
Example 2: Fraud alerts within seconds (Streaming)
Goal: flag risky transactions within 2 seconds.
- Trigger: event-driven (each transaction)
- Ops: enrich with recent user/device state; score model; emit alerts
- Windows: last 5 minutes to compute velocity features (e.g., count per card)
- Why streaming: low-latency decisioning
Key reliability choices
- Event-time windows + watermark to handle out-of-order events
- At-least-once processing + idempotent alert writes to avoid duplicates
Example 3: Product analytics with 5-minute freshness (Micro-batch)
Goal: update dashboard with active users per country every 5 minutes.
- Trigger: every 5 minutes
- Ops: read last N minutes of events; dedup; aggregate
- Why micro-batch: simple, good enough freshness, cost-effective
Edge cases
- Clock skew: rely on event_time and add a small overlap/lookback (e.g., 2 minutes) with deduplication by event_id.
Example 4: Hybrid for ML features
Goal: serve real-time features while keeping a clean offline history.
- Streaming layer updates the online feature store within seconds.
- Batch layer backfills and corrects features for training data daily.
- Both use the same feature definitions to ensure consistency.
Why hybrid?
Streaming for freshness, batch for completeness and correction of late data or historical fixes.
Design checklist
Exercises
Note: Everyone can take the exercises and Quick Test. If you log in, your progress is saved.
Exercise 1: Pick batch, micro-batch, or streaming
For each scenario, choose the best processing mode and justify in one sentence:
- Weekly marketing cohort tables.
- Updating a recommendation carousel with clicks and purchases under 30 seconds.
- CEO dashboard that refreshes every 10 minutes.
Tips
- Consider freshness target.
- Consider completeness and heavy joins.
- Consider operational simplicity.
Exercise 2: Windowing and late data policy
We have a 5-minute tumbling window based on event_time starting at 12:00. Allowed lateness: 2 minutes. Windows close at window_end + 2 minutes.
Events (event_time, arrival_time, amount):
- e1: 12:00:05, 12:00:06, 10
- e2: 12:04:59, 12:05:00, 5
- e3: 12:05:10, 12:05:11, 8
- e4: 12:04:30, 12:06:00, 7 (late)
- e5: 12:09:59, 12:10:01, 3
Questions:
- Q1: What are the sums for windows [12:00–12:05) and [12:05–12:10)?
- Q2: Is e4 included or dropped? Why?
- Q3: Write a short SQL-like snippet to compute 5-minute tumbling sums by event_time.
Hint
Window [12:00–12:05) closes at 12:07:00. Window [12:05–12:10) closes at 12:12:00.
Common mistakes (and how to self-check)
- Mistake: Using processing_time for user analytics where clocks differ. Self-check: Do results shift if ingestion delays change? If yes, switch to event_time.
- Mistake: No dedup keys with at-least-once. Self-check: Do occasional duplicates appear after retries? Add idempotent keys (e.g., event_id or business key).
- Mistake: Closing windows too early. Self-check: Compare late-data rate; if >1–2%, increase allowed lateness or add correction logic.
- Mistake: Ignoring backpressure. Self-check: Does lag grow during peaks? Add buffering, rate limits, or scale consumers.
- Mistake: Reprocessing batch outputs without overwrite policies. Self-check: Do you see double counts after reruns? Use partition overwrite + dedup.
Quick self-audit mini-check
Practical projects
- Build a daily feature generation batch job: load clean events for the prior day, join with dimensions, produce a training table with stable IDs and dedup logic.
- Create a streaming counter: maintain per-user 5-minute rolling counts with event_time windows and a 2-minute watermark; write idempotently to a key-value store.
- Hybrid correction: stream writes near-real-time features; nightly batch reprocesses last 3 days and corrects counts for late events.
Learning path
- Now: Batch vs streaming basics (this page)
- Next: Data ingestion patterns (files, topics, CDC), schema evolution
- Then: Feature stores and consistency between offline/online
- Advanced: Exactly-once effects, state management, backfills, and reprocessing
Mini challenge
You must power a churn-risk widget on the website showing "+1 risk" when a user has 3+ failures in the last 10 minutes. Average traffic is moderate; peak traffic spikes 3x. You also need daily aggregates for model training.
- Pick a mode for the widget (streaming vs micro-batch) and why.
- Define the windowing (type and size) and lateness policy.
- Explain how you will correct counts the next day for late events.
One possible direction
Streaming with 10-minute tumbling or sliding windows using event_time and a small allowed lateness; idempotent writes. Nightly batch reprocesses last day to correct and backfill training tables.
Next steps
- Complete the exercises above and compare with the solutions.
- Take the Quick Test to check retention (progress saved if logged in).
- Draft a simple pipeline design doc for your use case using the checklist.