How to learn Designing Real Time Pipelines for Streaming Systems Basics in Data Engineer for free

Who this is for

This lesson is for aspiring and practicing Data Engineers who want to design reliable, low-latency streaming data pipelines for dashboards, ML features, alerts, and automations.

Prerequisites

Comfort with batch data flows (ETL/ELT)
Basic understanding of messaging systems (topics, partitions, consumers)
Familiarity with SQL and one streaming framework conceptually (e.g., Flink, Spark Structured Streaming, Beam)

Why this matters

Real tasks you will face as a Data Engineer:

Power product dashboards with second-level latency
Continuously compute aggregates for personalization or pricing
Feed real-time features to ML models
Detect anomalies and trigger alerts
Move data from event streams to serving stores and data lakes safely

Design decisions (windows, semantics, partitioning, backpressure handling) determine reliability, cost, and user experience.

Concept explained simply

A real-time pipeline continuously ingests events, transforms them as they arrive, and outputs results with controlled latency and correctness. You trade off latency, cost, and strictness of correctness.

Mental model

Imagine a factory conveyor:

Source: events hop onto the belt (e.g., Kafka topic)
Stations: parse, validate, enrich, aggregate
Quality gates: schema checks, deduplication, DLQs
Pack and ship: results go to sinks (caches, databases, lakehouse)
Safety systems: checkpoints, retries, idempotency
Speed control: backpressure and autoscaling

Core design steps

Clarify outcome and latency: metrics, alerts, or features? Target end-to-end latency (e.g., 2s, 30s, 5m).
Define correctness: at-most-once, at-least-once, or exactly-once. Plan idempotency and dedup.
Choose event-time strategy: watermarks, allowed lateness, and window type (tumbling, hopping, sliding, session).
Model data contracts: schema format (Avro/Protobuf/JSON), evolution strategy, validation.
Partition and keys: select keys for ordering and parallelism; estimate partitions from peak throughput.
State & storage: choose state stores and checkpointing cadence; plan state TTLs and compaction.
Failure paths: retries, DLQ, poison pill handling, backfills/replays.
Observability: metrics (lag, watermark, error rate), logs, alerts, lineage.
Security & governance: PII, access control, encryption, retention.
Cost & scaling: autoscaling policy, resource quotas, backpressure strategy.

Common window types (quick reference)

Tumbling: fixed, non-overlapping (e.g., 5m)
Hopping (sliding with fixed step): overlapping windows (size S, slide R)
Session: grouped by inactivity gaps, great for user sessions

Delivery semantics (quick reference)

At-most-once: minimal latency, possible data loss
At-least-once: duplicates possible; combine with idempotent writes
Exactly-once: strongest guarantee; typically higher cost/complexity

Worked examples

Example 1: Clickstream dashboard (latency & windows)

Goal: show active users per minute with updates every 5 seconds.

Latency target: under 10s
Windows: hopping 1m window, slide 5s
Event-time: watermark at 2m to tolerate late events; allowed lateness 1m to update late counts
Semantics: at-least-once + idempotent sink (upserts by window_start, user_country)
Partitions: key by user_id for even spread; 48 partitions for peak load
Sinks: cache (for UI) + object storage (for backfill)

Why hopping windows?

Hopping windows give near-continuous updates for a longer aggregate (1 minute) while sliding every 5 seconds for freshness. Users see trends without waiting for a full minute.

Example 2: Payment fraud scoring (consistency & idempotency)

Goal: score transactions and trigger alerts with minimal false negatives.

Latency target: under 2s
Semantics: at-least-once processing; sink uses idempotent writes keyed by transaction_id
State: session windows per card_id to compute velocity features
Watermark: tight (30s) because decisions must be quick; late events still enrich features later
Failure: retries with exponential backoff, DLQ after N failures

Why not exactly-once?

Operationally, at-least-once with dedup is simpler and fast enough. Alerts tolerate dedup if consumer side is idempotent (same transaction_id alert ignored).

Example 3: IoT anomaly detection (ordering & backpressure)

Goal: detect temperature spikes per device with 10s latency.

Keying: device_id to keep order per device
Window: tumbling 10s, with per-device running stats
Backpressure plan: cap source read rate; autoscale consumers; shed non-critical enrichments first
Observability: track consumer lag, watermark delay, anomaly rate; alert when lag grows for 5 minutes

Handling out-of-order events

Use event-time processing and watermarks; buffer small lateness (e.g., 15s). Late events beyond allowed lateness go to a correction path or are logged for audit.

Key design choices to get right

Event-time vs processing-time: prefer event-time for correctness; use processing-time for simple, low-risk counters
Watermarks & lateness: balance freshness with completeness
Partitioning: enough partitions for peak; key for affinity and ordering
State growth: TTLs and compaction prevent unbounded state
Idempotency: upsert keys, exactly-once sinks, or dedup tables
Schema evolution: enforce compatibility; reject or route bad payloads to DLQ
Replay strategy: make reprocessing deterministic; control side effects

Go-live checklist

[ ] Defined latency SLO and measured end-to-end
[ ] Selected semantics and documented idempotency
[ ] Windowing and watermarks validated with late data
[ ] Partitions sized for peak with headroom
[ ] Backpressure plan and autoscaling tested
[ ] DLQ with sampling alert and triage playbook
[ ] Schema validation and compatibility tests
[ ] Observability: lag, watermark, error rate, throughput, costs
[ ] Security: PII handling, encryption, access controls

Exercises

Complete the tasks below. A worked solution is available in each exercise. Do the work first, then compare.

ex1 — Surge pricing features (design)
Design a pipeline to compute a real-time demand index per zone for a ride-sharing app. Use event-time, handle late events, and produce both a cache feed and a long-term store. See the full task in the Exercises panel below.
ex2 — Throughput sizing (math)
Given peak events and per-consumer capacity, compute the partition count and parallelism. See the full task in the Exercises panel below.

[ ] Wrote down latency SLO and semantics
[ ] Chose keys and partition count
[ ] Defined windowing and watermark policy
[ ] Planned DLQ and retries
[ ] Picked sinks with idempotency

Common mistakes and self-check

No event-time or watermarks: results drift when late events arrive. Self-check: compare event-time vs processing-time results over a day.
Too few partitions: persistent lag at peak. Self-check: peak lag grows even after scaling consumers.
No idempotency: duplicates inflate metrics. Self-check: simulate replays; totals should not double.
Unbounded state: memory/rocksdb bloat. Self-check: verify TTLs and compaction; chart state size over time.
Skipping schema validation: downstream breaks silently. Self-check: introduce a bad payload; ensure it routes to DLQ with an alert.
Over-tight watermarks: frequent late-data corrections. Self-check: track late ratio; adjust allowed lateness.

Practical projects

Build a real-time MAU counter with hopping windows; expose metrics via a simple API
Implement a DLQ triage job that samples messages and summarizes top error reasons hourly
Create a replay tool to backfill a day of events deterministically without duplicate side effects

Learning path

Before: event brokers, message formats, and basic streaming primitives
Now: designing real-time pipelines (this lesson)
Next: stateful processing patterns, watermark tuning, exactly-once sinks, and serving-store design

Next steps

Complete the exercises below
Take the Quick Test to check retention (available to everyone; logged-in users get saved progress)
Pick one Practical project and implement it with a small dataset

Mini challenge

You must compute per-customer 15-minute spend with updates every 30 seconds, tolerate 2 minutes of lateness, and avoid double-counting on replays. In one paragraph, specify: window type and slide, watermark and allowed lateness, delivery semantics, dedup/idempotency strategy, partition key, and sinks. Keep it concise and actionable.

Instructions

Scenario: You need a real-time demand index per zone for surge pricing. Events arrive on topic rides with fields ride_id, event_time (ISO), zone_id, status in {requested, accepted, started, completed}. Compute a demand_index per zone = requests_in_5m / drivers_available_in_5m (assume drivers_available events stream exists). Update results every 15 seconds, using event-time with late events up to 2 minutes.

Choose delivery semantics and explain idempotency.
Define windows/triggers: type, size, slide, watermark, allowed lateness.
Propose partitioning and parallelism assumptions.
Design failure handling: retries, DLQ, poison pills.
Specify sinks: fast cache for API and durable store for audit/backfill. Define upsert keys.

Menu

Designing Real Time Pipelines

Table of Contents