Why this matters
Stream joins let you connect events that arrive at different times and from different sources. As a Data Engineer, you will:
- Enrich click events with user or campaign data in real time.
- Correlate orders and payments to produce live revenue metrics.
- Detect fraud by joining card swipes with risk signals or blacklists.
- Build monitoring pipelines by joining logs with deployment or config streams.
Correct join design drives accuracy, latency, and cost. Misdesigned joins can drop events or blow up state.
Concept explained simply
A stream join matches events by key and time. Imagine two conveyor belts (streams) with items labeled by the same key. You only match items when they appear within a time window you choose.
Mental model
- Key by: partition both streams by the same join key.
- Window: define how long to keep events to find a match.
- Watermark: a progress signal that says how far event-time has likely advanced.
- State: temporary storage holding events until the window closes or matches occur.
Quick definitions
- Event time: when the event happened.
- Processing time: when the system processed it.
- Watermark: lower bound on event-time completeness with allowance for late data.
- Allowed lateness: grace period after window close to still accept late events.
Time fundamentals you must set
- Choose event time wherever you can. It produces correct, reproducible results.
- Define watermarks to tolerate normal lateness (e.g., 2–5 minutes for mobile clients).
- Pick allowed lateness small but realistic; bigger lateness means more state cost.
How watermarks affect joins
Windows close and state is cleaned only after the watermark passes the window end plus allowed lateness. If your watermark is too aggressive, late events will be dropped; too conservative and your memory use grows.
Join types and windows
- Stream–Stream join: match events from two unbounded streams by key and time.
- Stream–Table (lookup) join: enrich a stream with the latest value from a changelog-backed table.
- Table–Table join: join two changing tables (materialized views of streams).
Window styles for Stream–Stream
- Tumbling window: fixed, non-overlapping buckets (e.g., 5 minutes).
- Sliding window: overlapping windows that slide by a step.
- Session window: dynamic windows separated by periods of inactivity; less common for joins but useful for per-user correlations.
- Interval join (time-bounded): directly bound one stream relative to the other (e.g., A.ts within [B.ts - 5m, B.ts + 1m]).
Join semantics
- Inner: output only when both sides match in the window.
- Left/Right outer: keep unmatched events from one side with nulls for the other side after the window closes.
- Full outer: keep unmatched events from both sides with nulls.
Stream–Table vs Stream–Stream
- Stream–Table: low latency enrichment; usually uses latest table value at join time (processing time). Good for dimensions.
- Stream–Stream: correlation across time; both sides are unbounded and need windowing and state.
Designing a join (step-by-step)
- Define the business question. What counts as a match? How late can events arrive?
- Pick the key. Ensure both streams emit the same key field and are partitioned consistently.
- Choose time semantics. Prefer event time with a realistic watermark and allowed lateness.
- Select the join type. Inner for matched-only metrics; outer if you must output unmatched events.
- Pick window/interval size. Big enough to catch expected delays; small enough to bound state.
- Plan state cleanup. Configure TTL and window closing behavior.
- Decide on late data policy. Drop, route to a dead-letter topic, or trigger corrections.
Worked examples
Example 1: Clicks with Impressions (Stream–Stream, interval)
Goal: For each click, find the most recent impression of the same ad within the last 5 minutes.
- Key: ad_id
- Join: Click c INNER JOIN Impression i ON c.ad_id = i.ad_id AND i.event_time BETWEEN c.event_time - 5m AND c.event_time
- Watermark: 3 minutes; Allowed lateness: 2 minutes
- Result: Pairs each click with its impression or drops clicks with no prior impression within 5 minutes.
Why interval join over tumbling
Clicks and impressions rarely align to the same fixed window boundaries. Interval join aligns to event timestamps, not buckets, so it finds the nearest qualifying impression.
Example 2: Payments with Orders (Stream–Stream, left outer)
Goal: Identify payments that do not have a matching order within 10 minutes to catch anomalies.
- Key: order_id
- Join: Payment p LEFT OUTER JOIN Order o ON p.order_id = o.order_id AND o.event_time BETWEEN p.event_time - 10m AND p.event_time + 10m
- Watermark: 5 minutes; Allowed lateness: 2 minutes
- Result: Payments without an order within 10 minutes appear with null order fields.
Example 3: Stream–Table enrichment (User profile)
Goal: Attach the latest user tier to each event instantly.
- Key: user_id
- Join: Events e JOIN UsersTable u ON e.user_id = u.user_id
- Semantics: Typically uses the latest table value at join time; no window needed.
- Result: Each event carries current user_tier; historical backfill requires time-travel techniques, which are separate from a basic stream–table join.
Edge case: rapidly changing dimensions
If the dimension updates frequently, decide whether to accept minor version skew or buffer events to align with dimension update timestamps.
State and performance notes
- Partitioning: both sides must be partitioned by the same key to co-locate matches.
- State size drivers: window size, allowed lateness, cardinality of keys, and event rates.
- Backpressure: large windows and skewed keys increase memory usage and latency.
- Skew handling: hot keys may need key salting or custom load balancing.
- TTL/cleanup: configure state TTL to avoid unbounded growth.
Self-check for performance
- Is your watermark lag close to observed event delays?
- Are you seeing many late drops? Consider slightly larger allowed lateness.
- Is state growing steadily? Revisit window/TTL and check skew.
Common mistakes and how to self-check
- Using processing time for out-of-order data. Fix: use event time with watermarks.
- Too-large windows causing memory blowups. Fix: tighten window or key-space; consider interval join.
- Mismatched partitioning. Fix: ensure both streams are keyed by the same field before the join.
- No plan for late data. Fix: set allowed lateness and define a side output or correction path.
- Outer joins without a clear window close. Fix: confirm when to emit unmatched rows and when to null-fill.
Exercises
Do the task below. Then check your work.
Exercise 1 — Orders and Payments Join
Design a join to correlate orders and payments. Constraints:
- Events: orders(order_id, user_id, event_time) and payments(order_id, amount, event_time).
- Payments can arrive up to 8 minutes after orders; orders can arrive up to 2 minutes late.
- You need a single output per matched pair; unmatched payments must still be visible.
Provide: join type, window or interval definition, time semantics (watermark and allowed lateness), and a short pseudo-SQL or step plan.
Sample events to reason about
{"orders": [{"order_id": "o1", "event_time": "12:00:10"}],
"payments": [{"order_id": "o1", "event_time": "12:07:55"}]}- Checklist before you validate:
- Both streams keyed by order_id
- Event-time join with watermark
- Unmatched payments emitted (outer join)
- Reasonable state bounds (window + allowed lateness)
Mini challenge
You have stream A: shipment_status(order_id, status, event_time) and stream B: delivery_scan(order_id, scan_type, event_time). You need to match the first delivery_scan after the shipment reaches status="OutForDelivery" within 6 hours, and alert if none occurs. Sketch the join type, interval, and late data handling.
Hint
- Interval join anchored on the status event time.
- Left outer join to surface missing scans.
- Consider deduplicating to the first scan only.
Who this is for
- Data Engineers building real-time enrichment and correlation pipelines.
- Analytics engineers moving batch joins into streaming.
- Backend engineers integrating event streams with dimension data.
Prerequisites
- Understanding of event vs processing time and watermarks.
- Basics of keyed streams and partitioning.
- Comfort with windowing concepts (tumbling, sliding, session).
Learning path
- Before: Windows and Watermarks Basics
- This lesson: Stream Join Concepts
- Next: Exactly-once and State Management in Streaming
Practical projects
- Real-time attribution: join clicks to conversions with a 30-minute window.
- Order fulfillment tracker: join orders, payments, and shipment updates.
- Ops monitoring: join error logs with deployment events to highlight impacted services.
Next steps
- Take the quick test below. Everyone can take it; logged-in users get saved progress.
- Implement a small proof-of-concept join with synthetic streams to validate latency and state size.