Who this is for
This lesson is for aspiring and practicing Data Engineers who want to design reliable, low-latency streaming data pipelines for dashboards, ML features, alerts, and automations.
Prerequisites
- Comfort with batch data flows (ETL/ELT)
- Basic understanding of messaging systems (topics, partitions, consumers)
- Familiarity with SQL and one streaming framework conceptually (e.g., Flink, Spark Structured Streaming, Beam)
Why this matters
Real tasks you will face as a Data Engineer:
- Power product dashboards with second-level latency
- Continuously compute aggregates for personalization or pricing
- Feed real-time features to ML models
- Detect anomalies and trigger alerts
- Move data from event streams to serving stores and data lakes safely
Design decisions (windows, semantics, partitioning, backpressure handling) determine reliability, cost, and user experience.
Concept explained simply
A real-time pipeline continuously ingests events, transforms them as they arrive, and outputs results with controlled latency and correctness. You trade off latency, cost, and strictness of correctness.
Mental model
Imagine a factory conveyor:
- Source: events hop onto the belt (e.g., Kafka topic)
- Stations: parse, validate, enrich, aggregate
- Quality gates: schema checks, deduplication, DLQs
- Pack and ship: results go to sinks (caches, databases, lakehouse)
- Safety systems: checkpoints, retries, idempotency
- Speed control: backpressure and autoscaling
Core design steps
- Clarify outcome and latency: metrics, alerts, or features? Target end-to-end latency (e.g., 2s, 30s, 5m).
- Define correctness: at-most-once, at-least-once, or exactly-once. Plan idempotency and dedup.
- Choose event-time strategy: watermarks, allowed lateness, and window type (tumbling, hopping, sliding, session).
- Model data contracts: schema format (Avro/Protobuf/JSON), evolution strategy, validation.
- Partition and keys: select keys for ordering and parallelism; estimate partitions from peak throughput.
- State & storage: choose state stores and checkpointing cadence; plan state TTLs and compaction.
- Failure paths: retries, DLQ, poison pill handling, backfills/replays.
- Observability: metrics (lag, watermark, error rate), logs, alerts, lineage.
- Security & governance: PII, access control, encryption, retention.
- Cost & scaling: autoscaling policy, resource quotas, backpressure strategy.
Common window types (quick reference)
- Tumbling: fixed, non-overlapping (e.g., 5m)
- Hopping (sliding with fixed step): overlapping windows (size S, slide R)
- Session: grouped by inactivity gaps, great for user sessions
Delivery semantics (quick reference)
- At-most-once: minimal latency, possible data loss
- At-least-once: duplicates possible; combine with idempotent writes
- Exactly-once: strongest guarantee; typically higher cost/complexity
Worked examples
Example 1: Clickstream dashboard (latency & windows)
Goal: show active users per minute with updates every 5 seconds.
- Latency target: under 10s
- Windows: hopping 1m window, slide 5s
- Event-time: watermark at 2m to tolerate late events; allowed lateness 1m to update late counts
- Semantics: at-least-once + idempotent sink (upserts by window_start, user_country)
- Partitions: key by user_id for even spread; 48 partitions for peak load
- Sinks: cache (for UI) + object storage (for backfill)
Why hopping windows?
Hopping windows give near-continuous updates for a longer aggregate (1 minute) while sliding every 5 seconds for freshness. Users see trends without waiting for a full minute.
Example 2: Payment fraud scoring (consistency & idempotency)
Goal: score transactions and trigger alerts with minimal false negatives.
- Latency target: under 2s
- Semantics: at-least-once processing; sink uses idempotent writes keyed by transaction_id
- State: session windows per card_id to compute velocity features
- Watermark: tight (30s) because decisions must be quick; late events still enrich features later
- Failure: retries with exponential backoff, DLQ after N failures
Why not exactly-once?
Operationally, at-least-once with dedup is simpler and fast enough. Alerts tolerate dedup if consumer side is idempotent (same transaction_id alert ignored).
Example 3: IoT anomaly detection (ordering & backpressure)
Goal: detect temperature spikes per device with 10s latency.
- Keying: device_id to keep order per device
- Window: tumbling 10s, with per-device running stats
- Backpressure plan: cap source read rate; autoscale consumers; shed non-critical enrichments first
- Observability: track consumer lag, watermark delay, anomaly rate; alert when lag grows for 5 minutes
Handling out-of-order events
Use event-time processing and watermarks; buffer small lateness (e.g., 15s). Late events beyond allowed lateness go to a correction path or are logged for audit.
Key design choices to get right
- Event-time vs processing-time: prefer event-time for correctness; use processing-time for simple, low-risk counters
- Watermarks & lateness: balance freshness with completeness
- Partitioning: enough partitions for peak; key for affinity and ordering
- State growth: TTLs and compaction prevent unbounded state
- Idempotency: upsert keys, exactly-once sinks, or dedup tables
- Schema evolution: enforce compatibility; reject or route bad payloads to DLQ
- Replay strategy: make reprocessing deterministic; control side effects
Go-live checklist
- [ ] Defined latency SLO and measured end-to-end
- [ ] Selected semantics and documented idempotency
- [ ] Windowing and watermarks validated with late data
- [ ] Partitions sized for peak with headroom
- [ ] Backpressure plan and autoscaling tested
- [ ] DLQ with sampling alert and triage playbook
- [ ] Schema validation and compatibility tests
- [ ] Observability: lag, watermark, error rate, throughput, costs
- [ ] Security: PII handling, encryption, access controls
Exercises
Complete the tasks below. A worked solution is available in each exercise. Do the work first, then compare.
- ex1 — Surge pricing features (design)
Design a pipeline to compute a real-time demand index per zone for a ride-sharing app. Use event-time, handle late events, and produce both a cache feed and a long-term store. See the full task in the Exercises panel below. - ex2 — Throughput sizing (math)
Given peak events and per-consumer capacity, compute the partition count and parallelism. See the full task in the Exercises panel below.
- [ ] Wrote down latency SLO and semantics
- [ ] Chose keys and partition count
- [ ] Defined windowing and watermark policy
- [ ] Planned DLQ and retries
- [ ] Picked sinks with idempotency
Common mistakes and self-check
- No event-time or watermarks: results drift when late events arrive. Self-check: compare event-time vs processing-time results over a day.
- Too few partitions: persistent lag at peak. Self-check: peak lag grows even after scaling consumers.
- No idempotency: duplicates inflate metrics. Self-check: simulate replays; totals should not double.
- Unbounded state: memory/rocksdb bloat. Self-check: verify TTLs and compaction; chart state size over time.
- Skipping schema validation: downstream breaks silently. Self-check: introduce a bad payload; ensure it routes to DLQ with an alert.
- Over-tight watermarks: frequent late-data corrections. Self-check: track late ratio; adjust allowed lateness.
Practical projects
- Build a real-time MAU counter with hopping windows; expose metrics via a simple API
- Implement a DLQ triage job that samples messages and summarizes top error reasons hourly
- Create a replay tool to backfill a day of events deterministically without duplicate side effects
Learning path
- Before: event brokers, message formats, and basic streaming primitives
- Now: designing real-time pipelines (this lesson)
- Next: stateful processing patterns, watermark tuning, exactly-once sinks, and serving-store design
Next steps
- Complete the exercises below
- Take the Quick Test to check retention (available to everyone; logged-in users get saved progress)
- Pick one Practical project and implement it with a small dataset
Mini challenge
You must compute per-customer 15-minute spend with updates every 30 seconds, tolerate 2 minutes of lateness, and avoid double-counting on replays. In one paragraph, specify: window type and slide, watermark and allowed lateness, delivery semantics, dedup/idempotency strategy, partition key, and sinks. Keep it concise and actionable.