How to learn Handling Late Events for Streaming Systems Basics in Data Engineer for free

Who this is for

Data Engineers building real-time or near-real-time pipelines.
Analytics Engineers and ML Engineers consuming streaming aggregates.
Developers integrating event-driven features that must tolerate network delays, retries, and out-of-order data.

Prerequisites

Basic understanding of event streams, topics/partitions, and consumer groups.
Familiarity with windowing (tumbling/sliding/session) and aggregation.
High-level knowledge of a streaming engine (e.g., Beam, Flink, Spark Structured Streaming, Kafka Streams) is helpful but not required.

Why this matters

In production, events rarely arrive perfectly on time or in order. Phones go offline, retries happen, and networks spike. As a Data Engineer, you will:

Compute metrics (e.g., hourly sales, unique users) even when some events arrive minutes or hours late.
Prevent double counting by deduplicating late replays.
Balance freshness with correctness by choosing acceptable grace periods.
Decide whether to update past results, emit corrections, or route stragglers to a separate sink.

Concept explained simply

Two clocks exist in streaming:

Event time: when the event actually happened (from the producer/payload).
Processing time: when the event is seen by your job.

A watermark is the system’s best guess that “we have likely seen all events up to time T.” Allowed lateness is a grace period after a window closes to still accept and apply late events. Triggers control when partial or final results are emitted. Accumulation modes control how outputs evolve (append only, update, or retract + update).

Mental model

Imagine a conveyor belt (event time) with buckets for each minute. You scoop beans (events) into their correct buckets. The watermark is a moving line across the belt, telling you that earlier buckets are probably complete. Allowed lateness is a short grace period to still toss stray beans into an earlier bucket. If a bean shows up way too late, you decide: throw it away, put it in a "late bin" for inspection, or recalc an adjusted report.

Deep dive: Watermarks vs Allowed Lateness

Watermark: estimation of completeness, often event_time minus a max out-of-order delay.
Allowed lateness: how long after the window end you still accept updates for that window.
Late beyond allowed lateness: usually drop from the window’s state; optionally side-output for analysis.

Choosing an Accumulation Mode

Discarding: each firing is independent; no backfills. Simplest but misses corrections.
Accumulating: each firing includes prior counts; late events add to totals. Downstream must handle updates.
Accumulating-and-retracting: emits retractions for previously emitted values then corrected totals. Best for precise downstream sinks that support upserts or retractions.

Where do late events go?

Update the window (if within allowed lateness).
Route to a side output or a "late_events" topic/table for monitoring or offline reprocessing.
Drop entirely if your SLA prioritizes freshness and downstream cannot handle corrections.

Worked examples

Example 1: Hourly revenue with moderate lateness

Window: tumbling 1h on event time.
Watermark: event_time minus 5 minutes.
Allowed lateness: 10 minutes.
Accumulation: accumulating-and-retracting to keep downstream DB correct.

Window: 08:00–09:00
At 09:02 watermark ≈ 08:57 → window still open for late updates.
At 09:10 allowed lateness ends → finalize window; further events side-output.

Result: Late orders arriving at 09:07 still increase the 08:00–09:00 total. Orders at 09:25 go to late_events sink.

Example 2: Mobile sessions with long offline periods

Window: session windows with 30-minute gap.
Watermark: event_time minus 15 minutes (mobile networks are spiky).
Allowed lateness: 2 hours (users can come online later).
Trigger: early results every 5 minutes + final firing when watermark passes end + grace period.

Result: Dashboards update frequently with provisional counts, then converge as late events arrive within 2 hours.

Example 3: IoT sensors with occasional clock drift

Window: sliding 10-minute windows every 2 minutes.
Watermark: event_time minus 2 minutes (devices mostly in-order).
Allowed lateness: 3 minutes.
Dedup: device_id + event_ts key; keep last-seen event_ts per device for 24h.

Result: Small drifts are absorbed; extreme drifts go to side output for device clock diagnostics.

How to choose watermark and allowed lateness

Measure real lateness: examine the 95th/99th percentile delay between event_time and processing_time.
Pick a watermark delay slightly beyond typical out-of-order arrival (e.g., 99th percentile).
Choose allowed lateness based on business tolerance for corrections vs freshness.
Validate state cost: larger lateness increases state size and memory.
Define a policy for extreme late events (side output or offline reprocessing).

Exercises

Complete these two exercises. They mirror the tasks in the Exercises panel below. Write answers locally, then compare with the provided solutions.

Exercise 1: Configure windows, watermark, and allowed lateness

Goal: Create hourly revenue with 10-minute allowed lateness and route extra-late events to a side output. Use pseudocode similar to Apache Beam or Flink.

// Pseudocode (Beam-like)
stream
  .assignTimestampsFrom(event.event_time)
  .withWatermark(maxOutOfOrderness = 5m)
  .window(Tumbling.of(1h))
  .allowedLateness(10m)
  .trigger(early = processingTime(1m), finalOnWatermark = true)
  .accumulationMode(ACCUMULATING_AND_RETRACTING)
  .aggregate(sum(amount) by hour)
  .onLateBeyondAllowed(sendTo = late_events_sink)
  .sink(revenue_hourly_sink)

Checklist:
- Event-time timestamps assigned
- Watermark configured (5m)
- Allowed lateness (10m)
- Trigger with early + final
- Accumulation supports corrections
- Side output for too-late events

Expected behavior

Late events up to 10 minutes after window end update the window; later events are routed to late_events_sink and do not alter the aggregate.

Hints

Think: event time first, then watermark.
Choose an accumulation mode that can correct previously emitted values.

Solution

See the Exercises panel solution for ex1. Your configuration should closely match the pseudocode block above.

Exercise 2: Design a lateness policy for an IoT line

Scenario: Factory devices publish temperature every 5s. 99th percentile lateness is 2m, but sometimes devices reconnect after 20m. Produce 1-minute averages with minimal correction delay.

Propose: watermark delay, allowed lateness, triggers, dedup key, and late-event handling policy.
Checklist:
- Watermark accounts for typical lateness (≈2m)
- Allowed lateness balances state vs correctness
- Triggers emit early and final
- Dedup key is stable and bounded (device_id + event_ts)
- Side output or backfill plan for > allowed lateness

Expected content of your plan

Clear parameter values (numbers) and a short justification for each choice. Include what happens to events later than allowed lateness.

Hints

Consider memory: 20m grace may be expensive; can you choose 5–10m and send stragglers to a dead-letter topic?
Use early triggers to keep dashboards fresh while waiting for the watermark and grace period.

Solution

Example plan:
- Watermark: event_time minus 2m (matches P99 out-of-order).
- Allowed lateness: 8m (captures many reconnects without large state).
- Triggers: early every 30s; final on watermark + grace end.
- Dedup: key = device_id + event_ts; keep seen keys for 30m TTL.
- Late beyond 8m: route to side output for offline reconciliation; do not update windows.
Rationale: Frequent early results keep UIs fresh; 8m grace captures most reconnects while capping state growth.

Common mistakes and self-check

Using processing time windows for event-time problems. Self-check: Are you aggregating by when things happened or when they were seen?
Watermark too aggressive, dropping valid late events. Self-check: Compare drop rates to lateness distribution.
Unlimited allowed lateness causing unbounded state. Self-check: Inspect state size and checkpoint duration.
No dedup, leading to double counting during retries. Self-check: Do you have a stable id per event and a TTL for seen keys?
Downstream sinks not idempotent. Self-check: Are you upserting with keys, or retractions supported?

Practical projects

Late-event tolerant revenue dashboard: Build a small pipeline that computes hourly revenue with 5m watermark and 10m allowed lateness; show provisional vs final totals.
IoT monitoring with side outputs: Produce 1m averages and capture extreme late events into a separate topic/table, then create a simple report of how often they occur.
Dedup service: Implement a tiny stateful component that drops duplicates based on event_id within a 24h TTL and measure throughput and memory.

Learning path

Before this: Event-time vs processing-time; basic windowing concepts.
Now: Handling Late Events (this lesson) — watermarks, allowed lateness, triggers, accumulation modes, dedup.
Next: Exactly-once sinks, idempotent upserts, state tuning and backfills.

Mini challenge

Design a policy for ad-click attribution where conversions can arrive up to 48 hours after clicks. Choose windowing, watermark, allowed lateness, triggers, accumulation mode, dedup strategy, and what to do with > 48h conversions. Write 5–7 bullet points justifying each choice.

Quick test

Take the quick test to check your understanding. The test is available to everyone; only logged-in users get saved progress.

Menu

Handling Late Events

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Worked examples

Example 1: Hourly revenue with moderate lateness

Example 2: Mobile sessions with long offline periods

Example 3: IoT sensors with occasional clock drift

How to choose watermark and allowed lateness

Exercises

Exercise 1: Configure windows, watermark, and allowed lateness

Exercise 2: Design a lateness policy for an IoT line

Common mistakes and self-check

Practical projects

Learning path

Mini challenge

Quick test

Practice Exercises

Configure windows, watermark, and allowed lateness

Instructions

Expected Output

Design a lateness policy for an IoT line

Handling Late Events — Quick Test

Have questions about Handling Late Events?

AI Assistant