Menu

Topic 8 of 8

Designing Real Time Pipelines

Learn Designing Real Time Pipelines for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Who this is for

This lesson is for aspiring and practicing Data Engineers who want to design reliable, low-latency streaming data pipelines for dashboards, ML features, alerts, and automations.

Prerequisites

  • Comfort with batch data flows (ETL/ELT)
  • Basic understanding of messaging systems (topics, partitions, consumers)
  • Familiarity with SQL and one streaming framework conceptually (e.g., Flink, Spark Structured Streaming, Beam)

Why this matters

Real tasks you will face as a Data Engineer:

  • Power product dashboards with second-level latency
  • Continuously compute aggregates for personalization or pricing
  • Feed real-time features to ML models
  • Detect anomalies and trigger alerts
  • Move data from event streams to serving stores and data lakes safely

Design decisions (windows, semantics, partitioning, backpressure handling) determine reliability, cost, and user experience.

Concept explained simply

A real-time pipeline continuously ingests events, transforms them as they arrive, and outputs results with controlled latency and correctness. You trade off latency, cost, and strictness of correctness.

Mental model

Imagine a factory conveyor:

  • Source: events hop onto the belt (e.g., Kafka topic)
  • Stations: parse, validate, enrich, aggregate
  • Quality gates: schema checks, deduplication, DLQs
  • Pack and ship: results go to sinks (caches, databases, lakehouse)
  • Safety systems: checkpoints, retries, idempotency
  • Speed control: backpressure and autoscaling

Core design steps

  1. Clarify outcome and latency: metrics, alerts, or features? Target end-to-end latency (e.g., 2s, 30s, 5m).
  2. Define correctness: at-most-once, at-least-once, or exactly-once. Plan idempotency and dedup.
  3. Choose event-time strategy: watermarks, allowed lateness, and window type (tumbling, hopping, sliding, session).
  4. Model data contracts: schema format (Avro/Protobuf/JSON), evolution strategy, validation.
  5. Partition and keys: select keys for ordering and parallelism; estimate partitions from peak throughput.
  6. State & storage: choose state stores and checkpointing cadence; plan state TTLs and compaction.
  7. Failure paths: retries, DLQ, poison pill handling, backfills/replays.
  8. Observability: metrics (lag, watermark, error rate), logs, alerts, lineage.
  9. Security & governance: PII, access control, encryption, retention.
  10. Cost & scaling: autoscaling policy, resource quotas, backpressure strategy.
Common window types (quick reference)
  • Tumbling: fixed, non-overlapping (e.g., 5m)
  • Hopping (sliding with fixed step): overlapping windows (size S, slide R)
  • Session: grouped by inactivity gaps, great for user sessions
Delivery semantics (quick reference)
  • At-most-once: minimal latency, possible data loss
  • At-least-once: duplicates possible; combine with idempotent writes
  • Exactly-once: strongest guarantee; typically higher cost/complexity

Worked examples

Example 1: Clickstream dashboard (latency & windows)

Goal: show active users per minute with updates every 5 seconds.

  • Latency target: under 10s
  • Windows: hopping 1m window, slide 5s
  • Event-time: watermark at 2m to tolerate late events; allowed lateness 1m to update late counts
  • Semantics: at-least-once + idempotent sink (upserts by window_start, user_country)
  • Partitions: key by user_id for even spread; 48 partitions for peak load
  • Sinks: cache (for UI) + object storage (for backfill)
Why hopping windows?

Hopping windows give near-continuous updates for a longer aggregate (1 minute) while sliding every 5 seconds for freshness. Users see trends without waiting for a full minute.

Example 2: Payment fraud scoring (consistency & idempotency)

Goal: score transactions and trigger alerts with minimal false negatives.

  • Latency target: under 2s
  • Semantics: at-least-once processing; sink uses idempotent writes keyed by transaction_id
  • State: session windows per card_id to compute velocity features
  • Watermark: tight (30s) because decisions must be quick; late events still enrich features later
  • Failure: retries with exponential backoff, DLQ after N failures
Why not exactly-once?

Operationally, at-least-once with dedup is simpler and fast enough. Alerts tolerate dedup if consumer side is idempotent (same transaction_id alert ignored).

Example 3: IoT anomaly detection (ordering & backpressure)

Goal: detect temperature spikes per device with 10s latency.

  • Keying: device_id to keep order per device
  • Window: tumbling 10s, with per-device running stats
  • Backpressure plan: cap source read rate; autoscale consumers; shed non-critical enrichments first
  • Observability: track consumer lag, watermark delay, anomaly rate; alert when lag grows for 5 minutes
Handling out-of-order events

Use event-time processing and watermarks; buffer small lateness (e.g., 15s). Late events beyond allowed lateness go to a correction path or are logged for audit.

Key design choices to get right

  • Event-time vs processing-time: prefer event-time for correctness; use processing-time for simple, low-risk counters
  • Watermarks & lateness: balance freshness with completeness
  • Partitioning: enough partitions for peak; key for affinity and ordering
  • State growth: TTLs and compaction prevent unbounded state
  • Idempotency: upsert keys, exactly-once sinks, or dedup tables
  • Schema evolution: enforce compatibility; reject or route bad payloads to DLQ
  • Replay strategy: make reprocessing deterministic; control side effects

Go-live checklist

  • [ ] Defined latency SLO and measured end-to-end
  • [ ] Selected semantics and documented idempotency
  • [ ] Windowing and watermarks validated with late data
  • [ ] Partitions sized for peak with headroom
  • [ ] Backpressure plan and autoscaling tested
  • [ ] DLQ with sampling alert and triage playbook
  • [ ] Schema validation and compatibility tests
  • [ ] Observability: lag, watermark, error rate, throughput, costs
  • [ ] Security: PII handling, encryption, access controls

Exercises

Complete the tasks below. A worked solution is available in each exercise. Do the work first, then compare.

  1. ex1 — Surge pricing features (design)
    Design a pipeline to compute a real-time demand index per zone for a ride-sharing app. Use event-time, handle late events, and produce both a cache feed and a long-term store. See the full task in the Exercises panel below.
  2. ex2 — Throughput sizing (math)
    Given peak events and per-consumer capacity, compute the partition count and parallelism. See the full task in the Exercises panel below.
  • [ ] Wrote down latency SLO and semantics
  • [ ] Chose keys and partition count
  • [ ] Defined windowing and watermark policy
  • [ ] Planned DLQ and retries
  • [ ] Picked sinks with idempotency

Common mistakes and self-check

  • No event-time or watermarks: results drift when late events arrive. Self-check: compare event-time vs processing-time results over a day.
  • Too few partitions: persistent lag at peak. Self-check: peak lag grows even after scaling consumers.
  • No idempotency: duplicates inflate metrics. Self-check: simulate replays; totals should not double.
  • Unbounded state: memory/rocksdb bloat. Self-check: verify TTLs and compaction; chart state size over time.
  • Skipping schema validation: downstream breaks silently. Self-check: introduce a bad payload; ensure it routes to DLQ with an alert.
  • Over-tight watermarks: frequent late-data corrections. Self-check: track late ratio; adjust allowed lateness.

Practical projects

  • Build a real-time MAU counter with hopping windows; expose metrics via a simple API
  • Implement a DLQ triage job that samples messages and summarizes top error reasons hourly
  • Create a replay tool to backfill a day of events deterministically without duplicate side effects

Learning path

  • Before: event brokers, message formats, and basic streaming primitives
  • Now: designing real-time pipelines (this lesson)
  • Next: stateful processing patterns, watermark tuning, exactly-once sinks, and serving-store design

Next steps

  • Complete the exercises below
  • Take the Quick Test to check retention (available to everyone; logged-in users get saved progress)
  • Pick one Practical project and implement it with a small dataset

Mini challenge

You must compute per-customer 15-minute spend with updates every 30 seconds, tolerate 2 minutes of lateness, and avoid double-counting on replays. In one paragraph, specify: window type and slide, watermark and allowed lateness, delivery semantics, dedup/idempotency strategy, partition key, and sinks. Keep it concise and actionable.

Practice Exercises

2 exercises to complete

Instructions

Scenario: You need a real-time demand index per zone for surge pricing. Events arrive on topic rides with fields ride_id, event_time (ISO), zone_id, status in {requested, accepted, started, completed}. Compute a demand_index per zone = requests_in_5m / drivers_available_in_5m (assume drivers_available events stream exists). Update results every 15 seconds, using event-time with late events up to 2 minutes.

  1. Choose delivery semantics and explain idempotency.
  2. Define windows/triggers: type, size, slide, watermark, allowed lateness.
  3. Propose partitioning and parallelism assumptions.
  4. Design failure handling: retries, DLQ, poison pills.
  5. Specify sinks: fast cache for API and durable store for audit/backfill. Define upsert keys.
Expected Output
A concise design spec covering semantics, windowing (5m hop with 15s slide), watermark ~2m, partition key zone_id, idempotent upserts keyed by (zone_id, window_start). Two sinks: cache and object storage. DLQ policy described.

Have questions about Designing Real Time Pipelines?

AI Assistant

Ask questions about this tool