Why this matters
Data Platform Engineers power real-time products: live dashboards, fraud detection, alerting, personalization, and CDC pipelines into warehouses. Stream processing concepts let you build systems that are timely, correct under disorder, and scalable.
- Create real-time metrics (e.g., 1-minute conversion rate per region).
- Detect patterns (e.g., 3 failed logins in 2 minutes).
- Maintain continuously updated materialized views for APIs.
- Ingest CDC from OLTP to analytical stores with deduplication.
- Reprocess historical data safely when logic changes.
Concept explained simply
Streaming treats data as an unbounded sequence of small events. Instead of running a job once over a fixed file, we keep a job running and update results incrementally as events arrive.
- Event: a small record describing something that happened (with a timestamp).
- Topic/Stream: an append-only log of events, often partitioned for scale.
- Partition: a shard of the stream; events within a partition are ordered.
- Consumer group: set of workers that share the partitions.
- Offset/position: where a consumer is in the log; enables replay.
Time semantics
- Processing time: when the system sees the event.
- Event time: when the event actually occurred (embedded in the event).
- Out-of-order events: real systems deliver late or skewed events; event-time logic handles this.
Windows
- Tumbling: fixed, non-overlapping intervals (e.g., every 5 minutes).
- Hopping/Sliding: fixed size with a hop/slide (e.g., size 10 min, hop 1 min).
- Session: gaps of inactivity define the window per key.
Other essentials
- State: memory of past events (counts, sets, joins). Needs checkpointing for fault tolerance.
- Watermarks: the job’s best guess of how far event time has progressed; defines when to finalize results and how to treat late data.
- Delivery semantics: at-most-once (fast, may drop), at-least-once (may duplicate), exactly-once (no loss, no duplicates; usually via idempotent/transactional sinks with checkpoints).
- Backpressure: downstream is slower than upstream; good systems throttle to stay stable.
- Reprocessing: replay from offsets and recompute with new logic for correctness.
Mental model
Imagine an assembly line:
- Conveyor belt (stream) brings parts (events).
- Stations (operators) group parts by item (keyBy) and by time (window).
- Bins (state) store intermediate counts, sets, and joins.
- Clock on the wall (watermark) tells when we can seal a bin.
- Quality check (exactly-once) ensures each boxed result is shipped once, even if a station restarts.
Core building blocks
Events and schemas
Include a stable event_id, event_time, and a versioned schema. This enables deduplication and evolution without breaking consumers.
Partitioning
Partition by a stable key (e.g., user_id) to keep related events ordered and together for keyed operations. Choose partition count based on expected parallelism.
Offsets, checkpoints, and recovery
Regularly checkpoint operator state and source positions. On failure, restart from the last checkpoint and continue from stored offsets.
Event time, watermarks, and late data
Prefer event-time windows for correctness. Use watermarks to bound lateness and decide when to finalize windows; optionally allow a small "allowed lateness" period to update late results.
Windows and aggregations
Pick a window type that matches the question: fixed intervals for reporting, sessions for bursty behavior, or sliding windows for rolling metrics.
Stateful processing and joins
State enables counters, dedup sets, and stream-stream joins within a bounded time. Set TTLs to prevent unbounded growth.
Delivery semantics and idempotency
At-least-once is common; achieve exactness by making sinks idempotent (e.g., upserts keyed by event_id) or by using transactional writes.
Backpressure and scaling
Monitor lag and throughput. Scale consumers up to the number of partitions. Apply rate limits and batching to stabilize hot paths.
Reprocessing safely
Re-run from a historical offset with the same deterministic logic and idempotent sink so results remain consistent.
Worked examples
1) 5-minute unique visitors per page
- Key by page_id, window: tumbling 5 minutes (event time).
- State: a set of user_id per window; emit count at window close.
- Watermark: event time minus 2 minutes; allowed lateness: 1 minute to update late arrivals.
Sample output: {window_start: 12:00, page: "/home", unique_users: 1,245}
2) Fraud signal: 3 failed logins in 2 minutes
- Key by user_id. Sliding window: size 2 minutes, hop 10 seconds.
- Maintain a count of failures; if count >= 3, emit an alert.
- Clear or decay state after 5 minutes of inactivity.
Sample output: {user_id: 381, rule: "3_failures_2m", time: 09:21:10}
3) CDC into a warehouse with dedup
- Stream change events with primary keys and op types (insert/update/delete) and a monotonically increasing source LSN.
- Dedup on (table, primary_key, source_lsn) to handle at-least-once delivery.
- Sink as idempotent upserts; checkpoint state and last applied LSN for exactly-once outcomes.
Sample result: warehouse table continuously reflects the OLTP state with no duplicates.
Exercises
Do these in a notebook or as a design doc. Solutions are available in the exercise reveal sections when you open each exercise in this page. Everyone can take the exercises; only logged-in users get saved progress.
Exercise 1: 10-minute revenue per store with late events
Design a stream job that computes revenue per store every 10 minutes using event time. Late events may arrive up to 5 minutes late.
- Event schema: {event_id, event_time, store_id, price, quantity}.
- Window: tumbling 10 minutes (event time). Watermark = event time - 5 minutes.
- Aggregate: sum(price * quantity) by store_id per window.
- Handle duplicates via event_id. Sink results idempotently.
- Include a brief note on state TTL and backpressure handling.
Expected output: one row per store per window, e.g., {window_start: 14:00, store_id: 42, revenue: 1234.50}. Late events within 5 minutes update the same window.
Exercise 2: Join orders with payments exactly-once
Design a stream-stream join that emits an order_paid event when a payment matches an order within 30 minutes.
- Orders stream: {order_id, user_id, amount, event_time}.
- Payments stream: {payment_id, order_id, amount, event_time}.
- Event-time interval join: 30 minutes. Watermark lag: 10 minutes.
- Validate amounts; dedup using a stable id (payment_id or a composite key).
- Sink must be idempotent/transactional to avoid duplicates on retries.
- Route unmatched events after TTL to a dead-letter stream for investigation.
Expected output: {order_id, user_id, amount, paid_time, sources: [order_event_id, payment_id]} once per matched order.
- [Checklist] I chose event-time processing and stated my watermark.
- [Checklist] I explained my window choice and size.
- [Checklist] I defined deduplication keys and sink idempotency.
- [Checklist] I specified state TTLs and what late data does.
- [Checklist] I mentioned backpressure/scale behavior.
Common mistakes and self-check
- Using processing-time windows for business metrics that require event-time correctness. Self-check: would results change if events arrive late? If yes, use event time.
- No watermarks: windows never finalize or finalize inconsistently. Self-check: can you explain when a window closes?
- Unbounded state: joins/sets grow forever. Self-check: where is TTL/expiration defined?
- Ignoring duplicates: at-least-once delivery produces double counts. Self-check: what is your dedup key and sink strategy?
- Over-parallelizing beyond partitions: idle consumers waste resources. Self-check: consumers <= partitions?
- No reprocessing plan: fixes are risky. Self-check: how do you replay with the same logic and idempotent sink?
Practical projects
- Real-time product analytics: 1/5/60-minute metrics per product and region with late data handling.
- Sessionization: compute session lengths and conversion per user using session windows.
- CDC to warehouse: upsert user profiles with schema evolution and dedup by source sequence.
Who this is for
- Aspiring and practicing Data Platform Engineers building real-time data systems.
- Backend engineers integrating event-driven features.
- Analytics engineers needing fresh, continuously updated datasets.
Prerequisites
- Comfort with SQL and one programming language (e.g., Python/Java/Scala).
- Basic understanding of distributed systems (partitions, scaling).
- Familiarity with message logs/queues concepts.
Learning path
- Start: Stream concepts (this page).
- Next: Stream storage and delivery guarantees in practice.
- Then: Stateful operators, joins, and windowing patterns.
- Finally: Reprocessing, backfills, and end-to-end testing strategies.
Ready to check your understanding?
Take the Quick Test below. It is available to everyone; only logged-in users get saved progress and history.
Next steps
- Finish the exercises and ensure each checklist box is true.
- Run the quick test to confirm you’ve internalized the concepts.
- Pick one practical project and build a small end-to-end slice.
Mini challenge
You must compute rolling 1-hour revenue per SKU updated every 5 minutes, tolerate 10 minutes of lateness, and guarantee no duplicates in the sink.
- What window type and hop do you choose?
- What is your watermark and allowed lateness policy?
- How do you ensure exactly-once outcomes at the sink?
See one possible approach
Use a sliding window: size 60 minutes, hop 5 minutes on event time. Watermark: event time - 10 minutes; allow 5 minutes of late updates. Deduplicate by a stable event_id; write to a sink with idempotent upserts keyed by (sku, window_start). Checkpoint state and source offsets for recovery.