Menu

Topic 4 of 8

Handling Late Arriving Data

Learn Handling Late Arriving Data for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Who this is for

  • Data engineers building batch or streaming pipelines.
  • Analytics engineers/BI developers who need accurate, backfillable metrics.
  • Platform engineers designing ingestion and warehouse models.

Prerequisites

  • Comfort with SQL (SELECT, JOIN, GROUP BY, MERGE/UPSERT).
  • Basic understanding of partitioned data and incremental loads.
  • Familiarity with event timestamps vs load time.

Why this matters

In real systems, data often arrives late due to mobile offline usage, partner delays, retries, or upstream outages. As a Data Engineer, you will:

  • Produce accurate daily/weekly metrics even when events come days late.
  • Design idempotent upserts and backfills without double counting.
  • Set watermarks so reports finalize on time but still accept legitimate late data.
  • Maintain dimensional history when updates appear out of order.

Concept explained simply

Late-arriving data is any record whose event time is in the past relative to when it is processed. Think of sending birthday cards: some arrive on the day, others arrive days later. You still want the count of cards per day to be correct—eventually.

Key terms
  • Event time: when the event actually happened.
  • Processing time: when your pipeline sees the event.
  • Watermark: a moving threshold of event time; data older than it is considered final.
  • Allowed lateness: how far behind the watermark you still accept updates.
  • Idempotent load: running it multiple times yields the same result.
  • Backfill: reprocessing historical partitions to incorporate late/corrected data.
  • Upsert/MERGE: insert new rows, update existing ones deterministically.

Mental model

Picture a conveyor belt of dated envelopes (events). You sort them into bins by the envelope date (event_date), not when they arrived. You keep the bin for the recent days open (allowed lateness) and seal older bins (watermark passed). If something rare arrives far later, you run a special re-sort (backfill) for that specific bin.

Design patterns for late data

1) Watermarks and allowed lateness

  • Streaming: compute windows by event time; set allowed lateness (e.g., 3 days). Emit updates until the watermark passes, then finalize.
  • Batch: on each daily run, reprocess the last N days of event_date partitions (sliding window) to capture late arrivals.

2) Upserts with a deterministic rule

  • Use a stable business key (e.g., order_id) plus event_ts to decide the winner.
  • Rule of thumb: keep the row with the greatest event_ts (or highest version).

3) Deduplication

  • Exact duplicates: same business key and event_ts; keep one.
  • Near-duplicates: choose a tie-breaker (ingestion_id, sequence, or latest event_ts).

4) Backfills and reprocessing

  • Partition downstream fact tables by event_date.
  • When late data arrives, re-run only affected partitions (e.g., last 7–14 days).
  • Keep backfills idempotent and repeatable.

5) Slowly Changing Dimensions (SCD)

  • Type 2: use valid_from/valid_to and a current flag. Late updates opening in the past should close prior versions and insert a corrected version.
  • Type 1: overwrite if history isn’t needed.

6) Append-only versioned facts

  • Store corrections as new rows with a correction_type (original, late, cancel, replace) and compute the current truth in views.

7) Reconciliation and monitoring

  • Maintain audit counts per event_date (source vs processed) to spot gaps.
  • Alert when lateness exceeds the agreed threshold.

Worked examples

Example 1 — Streaming window revenue with allowed lateness

Goal: daily revenue by event_time, accept events up to 3 days late.

// Pseudocode
window = tumble(event_time, 1 day)
allowed_lateness = interval '3' day

// Emit updates when late events arrive within 3 days
// After watermark passes (event_time < current_time - 3d), close the day.

Outcome: dashboards update for three days, then the day is finalized.

Example 2 — Batch sliding window backfill

Goal: Each night, recompute the last 7 event_date partitions.

-- Pseudocode for scheduler
for d in dates(today - 6, today):
  rebuild_partition('fact_orders', event_date = d)

Outcome: any late events within a week are included automatically.

Example 3 — Idempotent MERGE for late updates

Goal: keep the latest event per order_id based on event_ts.

MERGE INTO warehouse.orders t
USING staging.orders_incr s
ON t.order_id = s.order_id
WHEN MATCHED AND s.event_ts > t.event_ts THEN
  UPDATE SET amount = s.amount,
             status = s.status,
             event_ts = s.event_ts
WHEN NOT MATCHED THEN
  INSERT (order_id, amount, status, event_ts)
  VALUES (s.order_id, s.amount, s.status, s.event_ts);

Outcome: late records correctly update prior rows; re-running is safe.

Example 4 — SCD Type 2 with a late attribute change
-- If a customer's segment changed effective 2024-12-15 but arrived on 2024-12-20
-- Close prior version and insert new one
UPDATE dim_customer
SET valid_to = '2024-12-15', is_current = false
WHERE customer_id = :id AND is_current = true AND valid_from < '2024-12-15';

INSERT INTO dim_customer (customer_id, segment, valid_from, valid_to, is_current)
VALUES (:id, :new_segment, '2024-12-15', null, true);

Outcome: queries by event_date see the correct segment historically.

Step-by-step: choose a strategy

  1. Map lateness. Measure arrival lag percentiles (P50/P90/P99) between event_time and ingestion_time.
  2. Pick an allowed lateness. Choose N days that balances completeness and timeliness (e.g., P99).
  3. Partition by event_date. Enables selective backfills.
  4. Define upsert rule. Decide conflict resolution (latest event_ts or version).
  5. Automate sliding reprocessing. Rebuild last N partitions every run.
  6. Add reconciliation. Compare source vs target counts and amounts per day.

Exercises

Do these to practice. The quick test is available to everyone; only logged-in users get saved progress.

  • Exercise 1 mirrors "ex1" below.
  • Exercise 2 mirrors "ex2" below.
Exercise 1 — Write an idempotent MERGE for late orders (ex1)

Tables:

  • stg_orders(order_id, amount, status, event_ts)
  • fact_orders(order_id, amount, status, event_ts)

Task: Upsert so the row with the greatest event_ts wins. Deduplicate exact duplicates in staging by keeping the latest event_ts per order_id before the MERGE.

Exercise 2 — Plan a sliding window backfill (ex2)

Scenario: 95% of events arrive within 2 days, 99% within 5 days. Pick a default window for daily reprocessing and explain how you’d handle the rare >5 day arrivals.

Exercise checklist

  • Upsert is idempotent and uses event_ts to resolve conflicts.
  • Staging dedup selected the latest event per business key.
  • Backfill window aligns with observed lateness (P99).
  • Plan includes a path for extremely late records (manual or scheduled deep backfill).

Common mistakes and self-check

  • Using processing time instead of event time. Self-check: do your partitions and aggregations key off event_time?
  • No dedup before MERGE. Self-check: do you group staging by key to one winner?
  • Unbounded backfills. Self-check: are you limiting reprocessing to a sliding window, unless an explicit deep backfill ticket exists?
  • Non-deterministic updates. Self-check: is your tie-breaker stable (event_ts, version)?
  • Finalizing too early. Self-check: does your watermark match observed P99 lateness?

Practical projects

  • Build a daily revenue fact with a 7-day sliding backfill; publish a metric showing how much each day changed after late arrivals.
  • Implement SCD Type 2 for customer segments; write queries that return the correct segment for any event_date.
  • Create an audit table logging counts and sums per event_date for source vs warehouse; raise alerts when gaps exceed a threshold.

Mini challenge

You discover that a partner batch is delayed by 4 days this week. Your pipeline currently reprocesses the last 3 days and finalizes days older than that. What minimal change would you make to avoid undercounting this week while keeping cost in check? Write your answer as a short plan (1–3 steps).

Tip: Think about temporarily adjusting allowed lateness or scheduling a one-off backfill for the impacted dates.

Learning path

  • Start: Idempotent upserts and dedup by business key + event_ts.
  • Add: Sliding window backfills (N-day reprocessing) based on observed lateness.
  • Next: Watermarks and allowed lateness in streaming jobs.
  • Then: SCD Type 2 for dimensions with late updates.
  • Finally: Reconciliation dashboards and automated alerts.

Next steps

  • Complete the exercises, then take the quick test below.
  • Apply one pattern in a sandbox project (MERGE + 7-day backfill).
  • Review your organization’s lateness percentiles and propose an allowed lateness policy.

Practice Exercises

2 exercises to complete

Instructions

Write SQL that upserts rows from stg_orders into fact_orders so the record with the greatest event_ts per order_id wins. First deduplicate stg_orders to one row per order_id with the max(event_ts). Then MERGE into the target using event_ts as the tie-breaker.

Tables:

  • stg_orders(order_id, amount, status, event_ts)
  • fact_orders(order_id, amount, status, event_ts)
Expected Output
After running, fact_orders contains one row per order_id reflecting the latest event_ts, with correct amount/status. Re-running yields the same result.

Handling Late Arriving Data — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Handling Late Arriving Data?

AI Assistant

Ask questions about this tool