Topic Not Found

Why this matters

In ETL and streaming pipelines, data often arrives late due to mobile offline usage, partner batch delays, clock drift, or retries. If you ignore lateness, you get wrong aggregates, broken foreign keys, and inconsistent facts. Handling late data correctly is how you keep dashboards, financials, and ML features trustworthy.

Daily revenue accurate even when a partner uploads yesterday’s orders today
Customer facts connected to the right Slowly Changing Dimension (SCD) version
Reliable event-time metrics for growth, fraud, and retention

Late arriving data — concept explained simply

Late arriving data means a record’s event_time is older than when you process it. Two clocks matter:

Event time: when the event actually happened
Processing time: when your system sees/processes it

Mental model: two clocks, three buckets

On-time: event_time is within your current window
Slightly late: within your allowed lateness (you can still update aggregates)
Very late: older than your watermark; route to backfill or quarantine

A watermark is your “we believe we have seen everything up to here (in event time)” boundary. You choose it (e.g., event_max - 30 minutes) based on business SLAs.

Design patterns that work

Event-time processing + watermarks

Drive windows and joins by event_time. Maintain a watermark (max event_time seen minus allowed lateness). Records older than the watermark are very late and should be handled separately.

As-of (temporal) joins to SCD2 dimensions

Join facts to the dimension version valid at event_time: effective_from <= event_time < effective_to. This ensures a late order still links to the historical customer segment valid back then.

Inferred/Unknown dimension rows

When a fact references a dimension member that does not exist yet, insert an inferred placeholder row (e.g., Unknown or Not Provided). Later, update it as the real dimension arrives and bridge facts if needed.

Idempotent upserts and deterministic deduplication

Define a stable key (e.g., business_id + event_id or sequence) and keep the latest version using a sequence/timestamp. Use MERGE semantics so replays do not double count.

Reprocessing windows

Keep aggregates for several recent windows mutable (e.g., last 3 days). Late arrivals within this window update aggregates; older ones feed a backfill job.

Dead-letter/quarantine for unjoinable or corrupt records

Route records that fail schema, have impossible timestamps, or miss required keys. Add reason codes to fix and replay.

Backfills without double counting

Backfill by recomputing from raw event-time slices, or by delta-correcting aggregates with idempotent merges. Always deduplicate first.

Audit and observability

Track lateness histograms, percent late per source, and number quarantined. Alert if lateness breaks your SLA.

Worked examples

Example 1 — Late fact with SCD2 as-of join

Data:

-- dim_customer_scd2
customer_sk | customer_id | segment | effective_from | effective_to
----------- | ----------- | ------- | -------------- | ------------
1           | 100         | Standard| 2023-01-01     | 2023-07-01
2           | 100         | Gold    | 2023-07-01     | 9999-12-31

-- fact_order_staging (late arrivals today)
order_id | customer_id | order_ts            | amount
-------- | ----------- | ------------------- | ------
2001     | 100         | 2023-05-10 10:00:00 | 50.00
2002     | 100         | 2023-07-15 09:00:00 | 90.00

Goal: Assign the correct customer_sk by order_ts:

SELECT f.order_id,
       d.customer_sk,
       f.amount
FROM fact_order_staging f
JOIN dim_customer_scd2 d
  ON d.customer_id = f.customer_id
 AND f.order_ts >= d.effective_from
 AND f.order_ts  < d.effective_to;

Result: 2001 → customer_sk=1 (Standard), 2002 → customer_sk=2 (Gold).

Example 2 — Watermark split: on-time vs very-late

Batch window T processes events written today. Allowed lateness: 30 minutes.

WITH batch AS (
  SELECT * FROM clicks_raw WHERE load_batch_id = :batch_id
), stats AS (
  SELECT MAX(event_time) AS max_event_time FROM batch
), wm AS (
  SELECT max_event_time - INTERVAL '30 minutes' AS watermark FROM stats
), dedup AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY user_id, event_id ORDER BY processing_time DESC) AS rn
  FROM batch
)
-- on-time and slightly late
INSERT INTO clicks_clean
SELECT d.* FROM dedup d CROSS JOIN wm
WHERE d.rn = 1 AND d.event_time >= wm.watermark;

-- very late
INSERT INTO clicks_late
SELECT d.* FROM dedup d CROSS JOIN wm
WHERE d.rn = 1 AND d.event_time < wm.watermark;

Now your aggregates update from clicks_clean; very late records are handled by a scheduled backfill.

Example 3 — Idempotent upsert using sequence

Given sensor readings may replay with higher seq_num. Keep the latest per (device_id, reading_id):

MERGE INTO sensor_readings t
USING (
  SELECT * FROM sensor_readings_staging s
  QUALIFY ROW_NUMBER() OVER (PARTITION BY device_id, reading_id ORDER BY seq_num DESC) = 1
) s
ON t.device_id = s.device_id AND t.reading_id = s.reading_id
WHEN MATCHED AND s.seq_num >= t.seq_num THEN
  UPDATE SET value = s.value, event_time = s.event_time, seq_num = s.seq_num
WHEN NOT MATCHED THEN
  INSERT (device_id, reading_id, value, event_time, seq_num)
  VALUES (s.device_id, s.reading_id, s.value, s.event_time, s.seq_num);

This prevents double counting and always keeps the newest version.

Step-by-step implementation blueprint

Define time fields: ensure every record has event_time and processing_time (or load_ts).
Choose primary keys: e.g., (source_id, event_id) or (business_id, sequence).
Select allowed lateness: align to SLAs (e.g., 30 minutes, 24 hours).
Compute watermark: event_max - allowed_lateness per batch/stream.
Deduplicate deterministically: use ROW_NUMBER by key ordered by sequence or processing_time.
Implement as-of joins for SCD2: join on effective_from <= event_time < effective_to.
Split flows: on-time path updates aggregates; very-late path quarantines or backfills.
Make writes idempotent: use MERGE with version/sequence guards.
Observe and alert: track percent late, quarantine size, and backfill lag.

Exercises (progress note)

The exercises and quick test are available to everyone. Only logged-in users will see saved progress in their account.

Exercise 1: As-of join for late facts (see below)
Exercise 2: Watermark split and dedup (see below)

Readiness checklist

I can explain event_time vs processing_time clearly
I can set and compute a watermark from recent data
I can do an as-of join to an SCD2 dimension
I can deduplicate deterministically by key and sequence
My writes are idempotent (safe to replay)
I have a quarantine path and a backfill plan
I monitor lateness and act on alerts

Common mistakes and how to self-check

Joining late facts to the current dimension

Mistake: joining on customer_id only. Fix: temporal join using effective_from/effective_to. Self-check: sample a late order and confirm it maps to the historical segment.

No deduplication before aggregates

Mistake: counting replays twice. Fix: ROW_NUMBER by key; keep rn=1. Self-check: re-ingest a batch; totals should not change.

Watermark too aggressive

Mistake: declaring data late too early. Fix: analyze lateness distribution; set allowed_lateness to cover 95–99% of cases. Self-check: percent very-late should be small and stable.

Backfills that double count

Mistake: additive backfills without clearing prior state. Fix: recompute from raw or use idempotent MERGE with version guards. Self-check: totals unchanged after backfill replay.

Practical projects

Build a mini pipeline for mobile events: simulate offline bursts, implement watermark split, dedup, and daily aggregates.
IoT stream: upsert latest sensor states by seq_num; produce hourly averages with 1-hour allowed lateness.
Partner CSV drops: SCD2 customer dimension + late orders fact; produce daily revenue as-of correct segment.

Who this is for

ETL Developers and Data Engineers working with batch or streaming pipelines
Analytics Engineers dealing with historical correctness
ML feature engineers needing stable, event-time features

Prerequisites

Comfort with SQL joins, window functions, and MERGE/UPSERT
Understanding of SCD2 dimensions and surrogate keys
Basic knowledge of batch windows or streaming concepts

Learning path

Review SCD types and temporal modeling
Learn deduplication and idempotent patterns
Implement watermarks and quarantine flows
Practice backfills on raw, immutable data

Next steps

Complete the exercises and run the quick test
Add lateness monitoring to a current pipeline
Schedule a safe backfill run and verify no double counting

Mini challenge

Your partner delivers a file 48 hours late containing last week’s orders, with some duplicates and missing customers. Draft a plan that covers: dedup keys, watermark threshold, inferred dimension rows, as-of join, and a backfill that won’t double count. Keep it to 8–10 bullet points.

Quick Test

Ready? Take the Quick Test below to check your understanding.

Instructions

You receive late orders that must link to the correct historical customer segment. Use an as-of join.

Tables:

-- dim_customer_scd2
customer_sk | customer_id | segment | effective_from       | effective_to
----------- | ----------- | ------- | -------------------- | ----------------
1           | 100         | Standard| 2023-01-01 00:00:00  | 2023-07-01 00:00:00
2           | 100         | Gold    | 2023-07-01 00:00:00  | 9999-12-31 00:00:00

-- fact_order_staging (late arrivals)
order_id | customer_id | order_ts              | amount
-------- | ----------- | --------------------- | ------
2001     | 100         | 2023-05-10 10:00:00   | 50.00
2002     | 100         | 2023-07-15 09:00:00   | 90.00

Task: Write a SQL SELECT that returns order_id, customer_sk, amount by joining fact_order_staging to dim_customer_scd2 as-of order_ts.

Menu

Handling Late Arriving Data

Table of Contents