luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Handling Late Arriving Data

Learn Handling Late Arriving Data for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

In ETL and streaming pipelines, data often arrives late due to mobile offline usage, partner batch delays, clock drift, or retries. If you ignore lateness, you get wrong aggregates, broken foreign keys, and inconsistent facts. Handling late data correctly is how you keep dashboards, financials, and ML features trustworthy.

  • Daily revenue accurate even when a partner uploads yesterday’s orders today
  • Customer facts connected to the right Slowly Changing Dimension (SCD) version
  • Reliable event-time metrics for growth, fraud, and retention

Late arriving data — concept explained simply

Late arriving data means a record’s event_time is older than when you process it. Two clocks matter:

  • Event time: when the event actually happened
  • Processing time: when your system sees/processes it

Mental model: two clocks, three buckets

  • On-time: event_time is within your current window
  • Slightly late: within your allowed lateness (you can still update aggregates)
  • Very late: older than your watermark; route to backfill or quarantine

A watermark is your “we believe we have seen everything up to here (in event time)” boundary. You choose it (e.g., event_max - 30 minutes) based on business SLAs.

Design patterns that work

Event-time processing + watermarks

Drive windows and joins by event_time. Maintain a watermark (max event_time seen minus allowed lateness). Records older than the watermark are very late and should be handled separately.

As-of (temporal) joins to SCD2 dimensions

Join facts to the dimension version valid at event_time: effective_from <= event_time < effective_to. This ensures a late order still links to the historical customer segment valid back then.

Inferred/Unknown dimension rows

When a fact references a dimension member that does not exist yet, insert an inferred placeholder row (e.g., Unknown or Not Provided). Later, update it as the real dimension arrives and bridge facts if needed.

Idempotent upserts and deterministic deduplication

Define a stable key (e.g., business_id + event_id or sequence) and keep the latest version using a sequence/timestamp. Use MERGE semantics so replays do not double count.

Reprocessing windows

Keep aggregates for several recent windows mutable (e.g., last 3 days). Late arrivals within this window update aggregates; older ones feed a backfill job.

Dead-letter/quarantine for unjoinable or corrupt records

Route records that fail schema, have impossible timestamps, or miss required keys. Add reason codes to fix and replay.

Backfills without double counting

Backfill by recomputing from raw event-time slices, or by delta-correcting aggregates with idempotent merges. Always deduplicate first.

Audit and observability

Track lateness histograms, percent late per source, and number quarantined. Alert if lateness breaks your SLA.

Worked examples

Example 1 — Late fact with SCD2 as-of join

Data:

-- dim_customer_scd2
customer_sk | customer_id | segment | effective_from | effective_to
----------- | ----------- | ------- | -------------- | ------------
1           | 100         | Standard| 2023-01-01     | 2023-07-01
2           | 100         | Gold    | 2023-07-01     | 9999-12-31

-- fact_order_staging (late arrivals today)
order_id | customer_id | order_ts            | amount
-------- | ----------- | ------------------- | ------
2001     | 100         | 2023-05-10 10:00:00 | 50.00
2002     | 100         | 2023-07-15 09:00:00 | 90.00

Goal: Assign the correct customer_sk by order_ts:

SELECT f.order_id,
       d.customer_sk,
       f.amount
FROM fact_order_staging f
JOIN dim_customer_scd2 d
  ON d.customer_id = f.customer_id
 AND f.order_ts >= d.effective_from
 AND f.order_ts  < d.effective_to;

Result: 2001 → customer_sk=1 (Standard), 2002 → customer_sk=2 (Gold).

Example 2 — Watermark split: on-time vs very-late

Batch window T processes events written today. Allowed lateness: 30 minutes.

WITH batch AS (
  SELECT * FROM clicks_raw WHERE load_batch_id = :batch_id
), stats AS (
  SELECT MAX(event_time) AS max_event_time FROM batch
), wm AS (
  SELECT max_event_time - INTERVAL '30 minutes' AS watermark FROM stats
), dedup AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY user_id, event_id ORDER BY processing_time DESC) AS rn
  FROM batch
)
-- on-time and slightly late
INSERT INTO clicks_clean
SELECT d.* FROM dedup d CROSS JOIN wm
WHERE d.rn = 1 AND d.event_time >= wm.watermark;

-- very late
INSERT INTO clicks_late
SELECT d.* FROM dedup d CROSS JOIN wm
WHERE d.rn = 1 AND d.event_time < wm.watermark;

Now your aggregates update from clicks_clean; very late records are handled by a scheduled backfill.

Example 3 — Idempotent upsert using sequence

Given sensor readings may replay with higher seq_num. Keep the latest per (device_id, reading_id):

MERGE INTO sensor_readings t
USING (
  SELECT * FROM sensor_readings_staging s
  QUALIFY ROW_NUMBER() OVER (PARTITION BY device_id, reading_id ORDER BY seq_num DESC) = 1
) s
ON t.device_id = s.device_id AND t.reading_id = s.reading_id
WHEN MATCHED AND s.seq_num >= t.seq_num THEN
  UPDATE SET value = s.value, event_time = s.event_time, seq_num = s.seq_num
WHEN NOT MATCHED THEN
  INSERT (device_id, reading_id, value, event_time, seq_num)
  VALUES (s.device_id, s.reading_id, s.value, s.event_time, s.seq_num);

This prevents double counting and always keeps the newest version.

Step-by-step implementation blueprint

  1. Define time fields: ensure every record has event_time and processing_time (or load_ts).
  2. Choose primary keys: e.g., (source_id, event_id) or (business_id, sequence).
  3. Select allowed lateness: align to SLAs (e.g., 30 minutes, 24 hours).
  4. Compute watermark: event_max - allowed_lateness per batch/stream.
  5. Deduplicate deterministically: use ROW_NUMBER by key ordered by sequence or processing_time.
  6. Implement as-of joins for SCD2: join on effective_from <= event_time < effective_to.
  7. Split flows: on-time path updates aggregates; very-late path quarantines or backfills.
  8. Make writes idempotent: use MERGE with version/sequence guards.
  9. Observe and alert: track percent late, quarantine size, and backfill lag.

Exercises (progress note)

The exercises and quick test are available to everyone. Only logged-in users will see saved progress in their account.

  • Exercise 1: As-of join for late facts (see below)
  • Exercise 2: Watermark split and dedup (see below)

Readiness checklist

  • I can explain event_time vs processing_time clearly
  • I can set and compute a watermark from recent data
  • I can do an as-of join to an SCD2 dimension
  • I can deduplicate deterministically by key and sequence
  • My writes are idempotent (safe to replay)
  • I have a quarantine path and a backfill plan
  • I monitor lateness and act on alerts

Common mistakes and how to self-check

Joining late facts to the current dimension

Mistake: joining on customer_id only. Fix: temporal join using effective_from/effective_to. Self-check: sample a late order and confirm it maps to the historical segment.

No deduplication before aggregates

Mistake: counting replays twice. Fix: ROW_NUMBER by key; keep rn=1. Self-check: re-ingest a batch; totals should not change.

Watermark too aggressive

Mistake: declaring data late too early. Fix: analyze lateness distribution; set allowed_lateness to cover 95–99% of cases. Self-check: percent very-late should be small and stable.

Backfills that double count

Mistake: additive backfills without clearing prior state. Fix: recompute from raw or use idempotent MERGE with version guards. Self-check: totals unchanged after backfill replay.

Practical projects

  • Build a mini pipeline for mobile events: simulate offline bursts, implement watermark split, dedup, and daily aggregates.
  • IoT stream: upsert latest sensor states by seq_num; produce hourly averages with 1-hour allowed lateness.
  • Partner CSV drops: SCD2 customer dimension + late orders fact; produce daily revenue as-of correct segment.

Who this is for

  • ETL Developers and Data Engineers working with batch or streaming pipelines
  • Analytics Engineers dealing with historical correctness
  • ML feature engineers needing stable, event-time features

Prerequisites

  • Comfort with SQL joins, window functions, and MERGE/UPSERT
  • Understanding of SCD2 dimensions and surrogate keys
  • Basic knowledge of batch windows or streaming concepts

Learning path

  • Review SCD types and temporal modeling
  • Learn deduplication and idempotent patterns
  • Implement watermarks and quarantine flows
  • Practice backfills on raw, immutable data

Next steps

  • Complete the exercises and run the quick test
  • Add lateness monitoring to a current pipeline
  • Schedule a safe backfill run and verify no double counting

Mini challenge

Your partner delivers a file 48 hours late containing last week’s orders, with some duplicates and missing customers. Draft a plan that covers: dedup keys, watermark threshold, inferred dimension rows, as-of join, and a backfill that won’t double count. Keep it to 8–10 bullet points.

Quick Test

Ready? Take the Quick Test below to check your understanding.

Practice Exercises

2 exercises to complete

Instructions

You receive late orders that must link to the correct historical customer segment. Use an as-of join.

Tables:

-- dim_customer_scd2
customer_sk | customer_id | segment | effective_from       | effective_to
----------- | ----------- | ------- | -------------------- | ----------------
1           | 100         | Standard| 2023-01-01 00:00:00  | 2023-07-01 00:00:00
2           | 100         | Gold    | 2023-07-01 00:00:00  | 9999-12-31 00:00:00

-- fact_order_staging (late arrivals)
order_id | customer_id | order_ts              | amount
-------- | ----------- | --------------------- | ------
2001     | 100         | 2023-05-10 10:00:00   | 50.00
2002     | 100         | 2023-07-15 09:00:00   | 90.00

Task: Write a SQL SELECT that returns order_id, customer_sk, amount by joining fact_order_staging to dim_customer_scd2 as-of order_ts.

Expected Output
Rows: (2001, 1, 50.00) and (2002, 2, 90.00) — correct surrogate keys per event_time.

Handling Late Arriving Data — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Handling Late Arriving Data?

AI Assistant

Ask questions about this tool