How to learn Batch And Streaming Basics for Data Pipelines in Machine Learning Engineer for free

Why this matters As a Machine Learning Engineer, you will move data from sources to features, models, and dashboards. Choosing batch or streaming affects cost, complexity, freshness, and reliability. You will design training data generation, feature stores, alerting systems, and real-time inference. Get this right early to avoid rework and outages. Daily model training: batch Real-time fraud scoring: streaming User behavior dashboards with a 5–10 minute delay: micro-batch or streaming with windows Who this is for New ML Engineers learning data pipelines Data Scientists preparing features and training sets Backend engineers moving into ML systems Prerequisites Basic Python or SQL familiarity Understanding of files, tables, and timestamps Comfort with simple ETL concepts (extract, transform, load) Concept explained simply Batch processes data in chunks on a schedule (e.g., hourly, daily). Streaming processes events continuously as they arrive. Quick mental model Batch = take a snapshot, clean it, compute results, write outputs. Great for completeness and heavy transforms. Streaming = watch an event river and update results as water flows. Great for freshness and alerting. Core terms Latency: how long until data shows up in results. Throughput: how much data you can process per unit time. Micro-batch: tiny scheduled batches (e.g., every 1–5 minutes). A middle ground. Event time vs processing time: event time is when something happened; processing time is when your system saw it. Windows (streaming): group events by time ranges to aggregate in real time. Tumbling: fixed, non-overlapping windows (e.g., [12:00–12:05), [12:05–12:10)). Sliding: overlapping windows (e.g., 5-min window sliding every 1 min). Session: gap-based windows that close after inactivity (e.g., 30s of no events). Watermarks: a system’s guess that events older than a point are unlikely to arrive, used to close windows. Delivery semantics: at-most-once (may drop), at-least-once (may duplicate), exactly-once (effects are as-if once). Backpressure: when downstream can’t keep up, upstream must slow down or buffer. Idempotency: repeating the same write has the same final effect (important for retries). When to choose batch vs streaming Choose batch if: you need complete data, heavy joins, daily/weekly outputs, cost-sensitive workloads. Choose streaming if: you need freshness (alerts, real-time features), low-latency decisions, continuous metrics. Choose micro-batch if: 1–10 minute freshness is fine and you prefer simpler ops. Reliability checklist for streaming Define event-time, late data policy, and watermark Use idempotent writes or deduplication keys Plan for backpressure (buffers, rate limits) Define retries and dead-letter handling Worked examples Example 1: Daily sales report (Batch) Goal: compute yesterday’s total sales per store. Trigger: daily at 01:00 Inputs: transactions table for the previous day Ops: clean, join with store table, aggregate Output: sales_by_store_date table for BI Why batch: complete, stable data; heavy joins; no urgency What could go wrong? Late-arriving transactions: include a small lookback (e.g., reprocess last 2 days) and dedup by transaction_id. Example 2: Fraud alerts within seconds (Streaming) Goal: flag risky transactions within 2 seconds. Trigger: event-driven (each transaction) Ops: enrich with recent user/device state; score model; emit alerts Windows: last 5 minutes to compute velocity features (e.g., count per card) Why streaming: low-latency decisioning Key reliability choices Event-time windows + watermark to handle out-of-order events At-least-once processing + idempotent alert writes to avoid duplicates Example 3: Product analytics with 5-minute freshness (Micro-batch) Goal: update dashboard with active users per country every 5 minutes. Trigger: every 5 minutes Ops: read last N minutes of events; dedup; aggregate Why micro-batch: simple, good enough freshness, cost-effective Edge cases Clock skew: rely on event_time and add a small overlap/lookback (e.g., 2 minutes) with deduplication by event_id. Example 4: Hybrid for ML features Goal: serve real-time features while keeping a clean offline history. Streaming layer updates the online feature store within seconds. Batch layer backfills and corrects features for training data daily. Both use the same feature definitions to ensure consistency. Why hybrid? Streaming for freshness, batch for completeness and correction of late data or historical fixes. Design checklist Define freshness target (e.g., seconds, minutes, hours) Identify data source type (append-only events, CDC, files, tables) Choose processing mode (batch, micro-batch, streaming) Specify time semantics (event_time vs processing_time) Decide on windows and watermark Set delivery semantics (at-most/at-least/exactly-once) Plan deduplication keys and idempotent sinks Define late data policy and dead-letter handling Outline monitoring and backpressure strategy Exercises Note: Everyone can take the exercises and Quick Test. If you log in, your progress is saved. Exercise 1: Pick batch, micro-batch, or streaming For each scenario, choose the best processing mode and justify in one sentence: Weekly marketing cohort tables. Updating a recommendation carousel with clicks and purchases under 30 seconds. CEO dashboard that refreshes every 10 minutes. Tips Consider freshness target. Consider completeness and heavy joins. Consider operational simplicity. Exercise 2: Windowing and late data policy We have a 5-minute tumbling window based on event_time starting at 12:00. Allowed lateness: 2 minutes. Windows close at window_end + 2 minutes. Events (event_time, arrival_time, amount): e1: 12:00:05, 12:00:06, 10 e2: 12:04:59, 12:05:00, 5 e3: 12:05:10, 12:05:11, 8 e4: 12:04:30, 12:06:00, 7 (late) e5: 12:09:59, 12:10:01, 3 Questions: Q1: What are the sums for windows [12:00–12:05) and [12:05–12:10)? Q2: Is e4 included or dropped? Why? Q3: Write a short SQL-like snippet to compute 5-minute tumbling sums by event_time. Hint Window [12:00–12:05) closes at 12:07:00. Window [12:05–12:10) closes at 12:12:00. Common mistakes (and how to self-check) Mistake: Using processing_time for user analytics where clocks differ. Self-check: Do results shift if ingestion delays change? If yes, switch to event_time. Mistake: No dedup keys with at-least-once. Self-check: Do occasional duplicates appear after retries? Add idempotent keys (e.g., event_id or business key). Mistake: Closing windows too early. Self-check: Compare late-data rate; if >1–2%, increase allowed lateness or add correction logic. Mistake: Ignoring backpressure. Self-check: Does lag grow during peaks? Add buffering, rate limits, or scale consumers. Mistake: Reprocessing batch outputs without overwrite policies. Self-check: Do you see double counts after reruns? Use partition overwrite + dedup. Quick self-audit mini-check I chose event_time and defined watermark/late data policy My sinks are idempotent or deduplicate I have monitoring on lag, error rate, and throughput Practical projects Build a daily feature generation batch job: load clean events for the prior day, join with dimensions, produce a training table with stable IDs and dedup logic. Create a streaming counter: maintain per-user 5-minute rolling counts with event_time windows and a 2-minute watermark; write idempotently to a key-value store. Hybrid correction: stream writes near-real-time features; nightly batch reprocesses last 3 days and corrects counts for late events. Learning path Now: Batch vs streaming basics (this page) Next: Data ingestion patterns (files, topics, CDC), schema evolution Then: Feature stores and consistency between offline/online Advanced: Exactly-once effects, state management, backfills, and reprocessing Mini challenge You must power a churn-risk widget on the website showing "+1 risk" when a user has 3+ failures in the last 10 minutes. Average traffic is moderate; peak traffic spikes 3x. You also need daily aggregates for model training. Pick a mode for the widget (streaming vs micro-batch) and why. Define the windowing (type and size) and lateness policy. Explain how you will correct counts the next day for late events. One possible direction Streaming with 10-minute tumbling or sliding windows using event_time and a small allowed lateness; idempotent writes. Nightly batch reprocesses last day to correct and backfill training tables. Next steps Complete the exercises above and compare with the solutions. Take the Quick Test to check retention (progress saved if logged in). Draft a simple pipeline design doc for your use case using the checklist.

Menu

Batch And Streaming Basics

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Core terms

Worked examples

Example 1: Daily sales report (Batch)

Example 2: Fraud alerts within seconds (Streaming)

Example 3: Product analytics with 5-minute freshness (Micro-batch)

Example 4: Hybrid for ML features

Design checklist

Exercises

Exercise 1: Pick batch, micro-batch, or streaming

Exercise 2: Windowing and late data policy

Common mistakes (and how to self-check)

Practical projects

Learning path

Mini challenge

Next steps

Practice Exercises

Choose batch, micro-batch, or streaming

Instructions

Expected Output

Windowing and late data policy

Batch And Streaming Basics — Quick Test

Have questions about Batch And Streaming Basics?

AI Assistant