luvv to helpDiscover the Best Free Online Tools

Streaming Systems Basics

Learn Streaming Systems Basics for Data Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 8, 2026 | Updated: January 8, 2026

Why this skill matters for Data Engineers

Streaming systems let you process events continuously instead of waiting for daily batches. As a Data Engineer, this unlocks real-time dashboards, fraud detection, ML feature pipelines, alerts, and low-latency data services. You will design pipelines that are resilient to out-of-order events, duplicates, and schema evolution while keeping costs predictable and operations observable.

Who this is for

  • Data Engineers moving from batch ETL to real-time processing.
  • Backend engineers who need reliable event pipelines.
  • Analytics engineers building fresh dashboards and metrics from events.

Prerequisites

  • Comfort with Python or Scala basics.
  • SQL fundamentals (GROUP BY, JOIN, window functions helpful).
  • High-level understanding of distributed systems (partitions, replicas).

Learning path (practical roadmap)

Milestone 1: Foundations
  • Grasp event time vs processing time and why clocks differ.
  • Learn delivery semantics: at-most-once, at-least-once, exactly-once.
  • Understand windows (tumbling, sliding, session) and aggregations.
Milestone 2: Robustness
  • Handle late and out-of-order events with watermarks.
  • Design idempotent sinks and deduplication strategies.
  • Plan for schema evolution with a schema registry conceptually.
Milestone 3: Composition
  • Implement stream joins safely with time bounds.
  • Build a basic real-time pipeline end-to-end (ingest → transform → store → serve).
Milestone 4: Operability
  • Measure and alert on consumer lag and processing latency.
  • Apply backpressure and scaling strategies.
  • Add replay procedures for recovery.
Quick study plan (7–10 hours)
  • 1.5h: Concepts (events, windows, semantics, late data, joins).
  • 2.5h: Worked examples below (run and tweak them).
  • 1h: Build a mini pipeline using the mini project outline.
  • 1h: Add schema evolution and dedup keys.
  • 1h: Add basic monitoring metrics and alerts.
  • 1h: Take the exam; revisit weak areas.

Worked examples

Example 1 — Minimal Kafka-style producer/consumer flow

Goal: Send JSON events with event_time in UTC, consume, and print a parsed field.

# Producer (Python pseudocode)
import json, time, uuid, random, datetime as dt
# from kafka import KafkaProducer  # example library if available
producer = None  # assume producer created with idempotence enabled

users = ["u1","u2","u3"]
for i in range(5):
    evt = {
        "event_id": str(uuid.uuid4()),
        "user_id": random.choice(users),
        "event_time": dt.datetime.utcnow().isoformat() + "Z",
        "value": random.randint(1,100)
    }
    payload = json.dumps(evt).encode("utf-8")
    # producer.send("events", key=evt["user_id"].encode(), value=payload)
    print("Produced:", payload)
    time.sleep(0.2)

# Consumer (pseudocode)
# for msg in consumer.poll(...):
#     e = json.loads(msg.value)
#     print(e["user_id"], e["value"])  

Key points: include an event_id, use a partitioning key (user_id) for ordering, and encode event_time in UTC.

Example 2 — Tumbling window aggregation (Spark Structured Streaming style)
# Python-style Spark example (conceptual)
from pyspark.sql import functions as F

# events: key, event_time, value
# df = spark.readStream.format("kafka").option("subscribe","events").load()
# parsed = df.select(F.from_json(F.col("value").cast("string"), schema).alias("e")).select("e.*")

agg = (parsed
       .withColumn("ts", F.to_timestamp("event_time"))
       .groupBy(F.window("ts", "1 minute"), F.col("user_id"))
       .agg(F.sum("value").alias("sum_value")))

# query = agg.writeStream.outputMode("update").format("console").start()

Key points: Tumbling windows are fixed-size, non-overlapping. Use event time for accurate aggregation.

Example 3 — Handling late events with watermarks
# Allow events up to 10 minutes late
with_wm = (parsed
    .withColumn("ts", F.to_timestamp("event_time"))
    .withWatermark("ts", "10 minutes")
    .groupBy(F.window("ts", "5 minutes"), "user_id")
    .agg(F.count("*").alias("cnt")))

Watermarks let the engine drop state for windows older than watermark, bounding memory while tolerating definable lateness.

Example 4 — Stream–stream join with time bounds
# Join click events to user profile updates within 15 minutes
clicks = parsed.withColumn("ts", F.to_timestamp("event_time")).withWatermark("ts","15 minutes")
profiles = profiles_stream.withColumn("ts2", F.to_timestamp("event_time")).withWatermark("ts2","15 minutes")

joined = (clicks.join(profiles, F.expr("clicks.user_id = profiles.user_id AND clicks.ts BETWEEN ts2 - interval 15 minutes AND ts2 + interval 15 minutes"), "inner"))

Require watermarks on both sides and a bounded time condition to keep state finite.

Example 5 — Achieving exactly-once to a warehouse via idempotent upserts
# Pattern: deduplicate by event_id and upsert
# 1) Ingest with at-least-once delivery.
# 2) In sink, MERGE by event_id (or (user_id, version)).
# 3) Make writes transactional if supported (e.g., a table with ACID semantics).

# Pseudocode transformation step
clean = parsed.dropDuplicates(["event_id"])  # engine-managed dedup if available
# sink.writeStream.option("mergeKey","event_id").format("delta").start()

Exactly-once is often achieved end-to-end by combining idempotent producers, transactional consumption, and idempotent upserts in sinks.

Example 6 — Estimating and monitoring consumer lag
# Lag = latest_offset - committed_offset, aggregated across partitions
# Example numbers per partition:
# p0: latest 1200, committed 1100 -> lag 100
# p1: latest 1500, committed 1490 -> lag 10
# Total lag: 110

# Heuristic: if (lag / events_per_min) > target_latency_minutes, scale consumers

Track both consumer lag and end-to-end latency (event_time to output_time). Alert when thresholds are breached.

Drills and exercises

  • Create a small dataset with 50 events where 10 arrive out-of-order by up to 7 minutes. Aggregate with a 5-minute window and a 10-minute watermark. Verify counts.
  • Inject duplicates (same event_id) and implement dedup in the pipeline. Confirm idempotent sink behavior.
  • Design a sliding window (2 minutes, slide 30 seconds) and compare results to a 2-minute tumbling window.
  • Build a stream–stream join with a 10-minute time bound and explain why unbounded joins are risky.
  • Simulate load to create consumer lag, then reduce lag by increasing parallelism. Record before/after metrics.

Common mistakes and debugging tips

  • Mistake: Using processing time instead of event time. Tip: Always parse and use event_time; test with out-of-order inputs.
  • Mistake: No watermark, unbounded state. Tip: Set a watermark that matches your lateness tolerance; monitor state size.
  • Mistake: Believing exactly-once without verifying the sink. Tip: Ensure idempotent upsert or transactional writes; test with retries.
  • Mistake: Unkeyed partitions causing hotspots. Tip: Choose a partition key with high cardinality; avoid skew.
  • Mistake: Joining unbounded streams without time constraints. Tip: Add watermarks to both sides and a bounded time range.
  • Mistake: Ignoring schema evolution. Tip: Use versioned schemas; make fields backward/forward compatible; add defaults for new optional fields.
  • Mistake: Lag alarms only. Tip: Track both lag and end-to-end latency; correlate with throughput and error rates.
Debugging checklist
  • Print sample records with event_time and processing_time to confirm ordering assumptions.
  • Turn on aggregation state metrics; watch watermark progression.
  • Force a duplicate batch and confirm idempotent behavior in the sink.
  • Stress test partition skew: randomize keys and check throughput per partition.
  • Simulate schema change: add an optional field and verify consumers continue to work.

Mini project: Real-time metrics with late-event tolerance

Goal: Build a pipeline that ingests user events, computes per-user 5-minute tumbling metrics with 10-minute lateness tolerance, joins with a small user profile stream, and writes idempotently to an analytics table.

  • Input: events(user_id, event_id, event_time, value), profiles(user_id, plan, event_time).
  • Processing: parse JSON → watermark both streams → aggregate per user → time-bounded join to enrich with plan.
  • Output: upsert to a table keyed by (window_start, user_id). Maintain an audit column last_updated.
Acceptance criteria
  • Aggregations correct despite 10% late events up to 10 minutes.
  • Duplicate events do not change final results (idempotent).
  • Join does not grow unbounded state; watermarks advance.
  • Lag stays below a set threshold under moderate load.
Stretch goals
  • Add a 30-second sliding window metric alongside the tumbling window.
  • Derive an alert stream when value spikes above a dynamic threshold (e.g., rolling 95th percentile).
  • Expose basic health metrics: input rate, processed rate, lag, watermark age.

Subskills

This skill includes the following focus areas:

  • Event Time Versus Processing Time
  • Exactly Once At Least Once Concepts
  • Windowing And Aggregations Basics
  • Handling Late Events
  • Stream Join Concepts
  • Schema Registry Basics
  • Monitoring Streaming Lag
  • Designing Real Time Pipelines

Next steps

  • Deepen stream processing patterns (sessions, CEP, stateful operators, backfills).
  • Practice with production-style sinks (upserts, compaction, partitioning, retention).
  • Add observability: dashboards for lag, throughput, watermark age, and error rates.
  • Take the skill exam below; if you pass, continue to advanced streaming design topics.

Practical project ideas

  • Real-time sales dashboard: per-store and per-category metrics with 5-minute tumbling windows and a 15-minute watermark; join with catalog updates.
  • Anomaly pings: detect devices with error_rate above threshold over sliding windows; emit alerts to a notification topic.
  • Click-to-conversion funnel: sessionize clicks, join to conversion events within 24 hours, and compute conversion rates per campaign.

Streaming Systems Basics — Skill Exam

This is a self-check exam covering streaming fundamentals: time semantics, windows, delivery guarantees, joins, schema evolution, and operations. Everyone can take it for free. If you are logged in, your progress and best score will be saved automatically.Rules: closed-book is recommended but not required. Aim for 70% to pass. You can retake anytime.

15 questions70% to pass

Have questions about Streaming Systems Basics?

AI Assistant

Ask questions about this tool