Why this skill matters for Data Engineers
Streaming systems let you process events continuously instead of waiting for daily batches. As a Data Engineer, this unlocks real-time dashboards, fraud detection, ML feature pipelines, alerts, and low-latency data services. You will design pipelines that are resilient to out-of-order events, duplicates, and schema evolution while keeping costs predictable and operations observable.
Who this is for
- Data Engineers moving from batch ETL to real-time processing.
- Backend engineers who need reliable event pipelines.
- Analytics engineers building fresh dashboards and metrics from events.
Prerequisites
- Comfort with Python or Scala basics.
- SQL fundamentals (GROUP BY, JOIN, window functions helpful).
- High-level understanding of distributed systems (partitions, replicas).
Learning path (practical roadmap)
- Grasp event time vs processing time and why clocks differ.
- Learn delivery semantics: at-most-once, at-least-once, exactly-once.
- Understand windows (tumbling, sliding, session) and aggregations.
- Handle late and out-of-order events with watermarks.
- Design idempotent sinks and deduplication strategies.
- Plan for schema evolution with a schema registry conceptually.
- Implement stream joins safely with time bounds.
- Build a basic real-time pipeline end-to-end (ingest → transform → store → serve).
- Measure and alert on consumer lag and processing latency.
- Apply backpressure and scaling strategies.
- Add replay procedures for recovery.
Quick study plan (7–10 hours)
- 1.5h: Concepts (events, windows, semantics, late data, joins).
- 2.5h: Worked examples below (run and tweak them).
- 1h: Build a mini pipeline using the mini project outline.
- 1h: Add schema evolution and dedup keys.
- 1h: Add basic monitoring metrics and alerts.
- 1h: Take the exam; revisit weak areas.
Worked examples
Example 1 — Minimal Kafka-style producer/consumer flow
Goal: Send JSON events with event_time in UTC, consume, and print a parsed field.
# Producer (Python pseudocode)
import json, time, uuid, random, datetime as dt
# from kafka import KafkaProducer # example library if available
producer = None # assume producer created with idempotence enabled
users = ["u1","u2","u3"]
for i in range(5):
evt = {
"event_id": str(uuid.uuid4()),
"user_id": random.choice(users),
"event_time": dt.datetime.utcnow().isoformat() + "Z",
"value": random.randint(1,100)
}
payload = json.dumps(evt).encode("utf-8")
# producer.send("events", key=evt["user_id"].encode(), value=payload)
print("Produced:", payload)
time.sleep(0.2)
# Consumer (pseudocode)
# for msg in consumer.poll(...):
# e = json.loads(msg.value)
# print(e["user_id"], e["value"])
Key points: include an event_id, use a partitioning key (user_id) for ordering, and encode event_time in UTC.
Example 2 — Tumbling window aggregation (Spark Structured Streaming style)
# Python-style Spark example (conceptual)
from pyspark.sql import functions as F
# events: key, event_time, value
# df = spark.readStream.format("kafka").option("subscribe","events").load()
# parsed = df.select(F.from_json(F.col("value").cast("string"), schema).alias("e")).select("e.*")
agg = (parsed
.withColumn("ts", F.to_timestamp("event_time"))
.groupBy(F.window("ts", "1 minute"), F.col("user_id"))
.agg(F.sum("value").alias("sum_value")))
# query = agg.writeStream.outputMode("update").format("console").start()
Key points: Tumbling windows are fixed-size, non-overlapping. Use event time for accurate aggregation.
Example 3 — Handling late events with watermarks
# Allow events up to 10 minutes late
with_wm = (parsed
.withColumn("ts", F.to_timestamp("event_time"))
.withWatermark("ts", "10 minutes")
.groupBy(F.window("ts", "5 minutes"), "user_id")
.agg(F.count("*").alias("cnt")))
Watermarks let the engine drop state for windows older than watermark, bounding memory while tolerating definable lateness.
Example 4 — Stream–stream join with time bounds
# Join click events to user profile updates within 15 minutes
clicks = parsed.withColumn("ts", F.to_timestamp("event_time")).withWatermark("ts","15 minutes")
profiles = profiles_stream.withColumn("ts2", F.to_timestamp("event_time")).withWatermark("ts2","15 minutes")
joined = (clicks.join(profiles, F.expr("clicks.user_id = profiles.user_id AND clicks.ts BETWEEN ts2 - interval 15 minutes AND ts2 + interval 15 minutes"), "inner"))
Require watermarks on both sides and a bounded time condition to keep state finite.
Example 5 — Achieving exactly-once to a warehouse via idempotent upserts
# Pattern: deduplicate by event_id and upsert
# 1) Ingest with at-least-once delivery.
# 2) In sink, MERGE by event_id (or (user_id, version)).
# 3) Make writes transactional if supported (e.g., a table with ACID semantics).
# Pseudocode transformation step
clean = parsed.dropDuplicates(["event_id"]) # engine-managed dedup if available
# sink.writeStream.option("mergeKey","event_id").format("delta").start()
Exactly-once is often achieved end-to-end by combining idempotent producers, transactional consumption, and idempotent upserts in sinks.
Example 6 — Estimating and monitoring consumer lag
# Lag = latest_offset - committed_offset, aggregated across partitions
# Example numbers per partition:
# p0: latest 1200, committed 1100 -> lag 100
# p1: latest 1500, committed 1490 -> lag 10
# Total lag: 110
# Heuristic: if (lag / events_per_min) > target_latency_minutes, scale consumers
Track both consumer lag and end-to-end latency (event_time to output_time). Alert when thresholds are breached.
Drills and exercises
- Create a small dataset with 50 events where 10 arrive out-of-order by up to 7 minutes. Aggregate with a 5-minute window and a 10-minute watermark. Verify counts.
- Inject duplicates (same event_id) and implement dedup in the pipeline. Confirm idempotent sink behavior.
- Design a sliding window (2 minutes, slide 30 seconds) and compare results to a 2-minute tumbling window.
- Build a stream–stream join with a 10-minute time bound and explain why unbounded joins are risky.
- Simulate load to create consumer lag, then reduce lag by increasing parallelism. Record before/after metrics.
Common mistakes and debugging tips
- Mistake: Using processing time instead of event time. Tip: Always parse and use event_time; test with out-of-order inputs.
- Mistake: No watermark, unbounded state. Tip: Set a watermark that matches your lateness tolerance; monitor state size.
- Mistake: Believing exactly-once without verifying the sink. Tip: Ensure idempotent upsert or transactional writes; test with retries.
- Mistake: Unkeyed partitions causing hotspots. Tip: Choose a partition key with high cardinality; avoid skew.
- Mistake: Joining unbounded streams without time constraints. Tip: Add watermarks to both sides and a bounded time range.
- Mistake: Ignoring schema evolution. Tip: Use versioned schemas; make fields backward/forward compatible; add defaults for new optional fields.
- Mistake: Lag alarms only. Tip: Track both lag and end-to-end latency; correlate with throughput and error rates.
Debugging checklist
- Print sample records with event_time and processing_time to confirm ordering assumptions.
- Turn on aggregation state metrics; watch watermark progression.
- Force a duplicate batch and confirm idempotent behavior in the sink.
- Stress test partition skew: randomize keys and check throughput per partition.
- Simulate schema change: add an optional field and verify consumers continue to work.
Mini project: Real-time metrics with late-event tolerance
Goal: Build a pipeline that ingests user events, computes per-user 5-minute tumbling metrics with 10-minute lateness tolerance, joins with a small user profile stream, and writes idempotently to an analytics table.
- Input: events(user_id, event_id, event_time, value), profiles(user_id, plan, event_time).
- Processing: parse JSON → watermark both streams → aggregate per user → time-bounded join to enrich with plan.
- Output: upsert to a table keyed by (window_start, user_id). Maintain an audit column last_updated.
Acceptance criteria
- Aggregations correct despite 10% late events up to 10 minutes.
- Duplicate events do not change final results (idempotent).
- Join does not grow unbounded state; watermarks advance.
- Lag stays below a set threshold under moderate load.
Stretch goals
- Add a 30-second sliding window metric alongside the tumbling window.
- Derive an alert stream when value spikes above a dynamic threshold (e.g., rolling 95th percentile).
- Expose basic health metrics: input rate, processed rate, lag, watermark age.
Subskills
This skill includes the following focus areas:
- Event Time Versus Processing Time
- Exactly Once At Least Once Concepts
- Windowing And Aggregations Basics
- Handling Late Events
- Stream Join Concepts
- Schema Registry Basics
- Monitoring Streaming Lag
- Designing Real Time Pipelines
Next steps
- Deepen stream processing patterns (sessions, CEP, stateful operators, backfills).
- Practice with production-style sinks (upserts, compaction, partitioning, retention).
- Add observability: dashboards for lag, throughput, watermark age, and error rates.
- Take the skill exam below; if you pass, continue to advanced streaming design topics.
Practical project ideas
- Real-time sales dashboard: per-store and per-category metrics with 5-minute tumbling windows and a 15-minute watermark; join with catalog updates.
- Anomaly pings: detect devices with error_rate above threshold over sliding windows; emit alerts to a notification topic.
- Click-to-conversion funnel: sessionize clicks, join to conversion events within 24 hours, and compute conversion rates per campaign.