Topic Not Found

Who this is for

Data engineers building batch or streaming pipelines.
Analytics engineers/BI developers who need accurate, backfillable metrics.
Platform engineers designing ingestion and warehouse models.

Prerequisites

Comfort with SQL (SELECT, JOIN, GROUP BY, MERGE/UPSERT).
Basic understanding of partitioned data and incremental loads.
Familiarity with event timestamps vs load time.

Why this matters

In real systems, data often arrives late due to mobile offline usage, partner delays, retries, or upstream outages. As a Data Engineer, you will:

Produce accurate daily/weekly metrics even when events come days late.
Design idempotent upserts and backfills without double counting.
Set watermarks so reports finalize on time but still accept legitimate late data.
Maintain dimensional history when updates appear out of order.

Concept explained simply

Late-arriving data is any record whose event time is in the past relative to when it is processed. Think of sending birthday cards: some arrive on the day, others arrive days later. You still want the count of cards per day to be correct—eventually.

Key terms

Event time: when the event actually happened.
Processing time: when your pipeline sees the event.
Watermark: a moving threshold of event time; data older than it is considered final.
Allowed lateness: how far behind the watermark you still accept updates.
Idempotent load: running it multiple times yields the same result.
Backfill: reprocessing historical partitions to incorporate late/corrected data.
Upsert/MERGE: insert new rows, update existing ones deterministically.

Mental model

Picture a conveyor belt of dated envelopes (events). You sort them into bins by the envelope date (event_date), not when they arrived. You keep the bin for the recent days open (allowed lateness) and seal older bins (watermark passed). If something rare arrives far later, you run a special re-sort (backfill) for that specific bin.

Design patterns for late data

1) Watermarks and allowed lateness

Streaming: compute windows by event time; set allowed lateness (e.g., 3 days). Emit updates until the watermark passes, then finalize.
Batch: on each daily run, reprocess the last N days of event_date partitions (sliding window) to capture late arrivals.

2) Upserts with a deterministic rule

Use a stable business key (e.g., order_id) plus event_ts to decide the winner.
Rule of thumb: keep the row with the greatest event_ts (or highest version).

3) Deduplication

Exact duplicates: same business key and event_ts; keep one.
Near-duplicates: choose a tie-breaker (ingestion_id, sequence, or latest event_ts).

4) Backfills and reprocessing

Partition downstream fact tables by event_date.
When late data arrives, re-run only affected partitions (e.g., last 7–14 days).
Keep backfills idempotent and repeatable.

5) Slowly Changing Dimensions (SCD)

Type 2: use valid_from/valid_to and a current flag. Late updates opening in the past should close prior versions and insert a corrected version.
Type 1: overwrite if history isn’t needed.

6) Append-only versioned facts

Store corrections as new rows with a correction_type (original, late, cancel, replace) and compute the current truth in views.

7) Reconciliation and monitoring

Maintain audit counts per event_date (source vs processed) to spot gaps.
Alert when lateness exceeds the agreed threshold.

Worked examples

Example 1 — Streaming window revenue with allowed lateness

Goal: daily revenue by event_time, accept events up to 3 days late.

// Pseudocode
window = tumble(event_time, 1 day)
allowed_lateness = interval '3' day

// Emit updates when late events arrive within 3 days
// After watermark passes (event_time < current_time - 3d), close the day.

Outcome: dashboards update for three days, then the day is finalized.

Example 2 — Batch sliding window backfill

Goal: Each night, recompute the last 7 event_date partitions.

-- Pseudocode for scheduler
for d in dates(today - 6, today):
  rebuild_partition('fact_orders', event_date = d)

Outcome: any late events within a week are included automatically.

Example 3 — Idempotent MERGE for late updates

Goal: keep the latest event per order_id based on event_ts.

MERGE INTO warehouse.orders t
USING staging.orders_incr s
ON t.order_id = s.order_id
WHEN MATCHED AND s.event_ts > t.event_ts THEN
  UPDATE SET amount = s.amount,
             status = s.status,
             event_ts = s.event_ts
WHEN NOT MATCHED THEN
  INSERT (order_id, amount, status, event_ts)
  VALUES (s.order_id, s.amount, s.status, s.event_ts);

Outcome: late records correctly update prior rows; re-running is safe.

Example 4 — SCD Type 2 with a late attribute change

-- If a customer's segment changed effective 2024-12-15 but arrived on 2024-12-20
-- Close prior version and insert new one
UPDATE dim_customer
SET valid_to = '2024-12-15', is_current = false
WHERE customer_id = :id AND is_current = true AND valid_from < '2024-12-15';

INSERT INTO dim_customer (customer_id, segment, valid_from, valid_to, is_current)
VALUES (:id, :new_segment, '2024-12-15', null, true);

Outcome: queries by event_date see the correct segment historically.

Step-by-step: choose a strategy

Map lateness. Measure arrival lag percentiles (P50/P90/P99) between event_time and ingestion_time.
Pick an allowed lateness. Choose N days that balances completeness and timeliness (e.g., P99).
Partition by event_date. Enables selective backfills.
Define upsert rule. Decide conflict resolution (latest event_ts or version).
Automate sliding reprocessing. Rebuild last N partitions every run.
Add reconciliation. Compare source vs target counts and amounts per day.

Exercises

Do these to practice. The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 mirrors "ex1" below.
Exercise 2 mirrors "ex2" below.

Exercise 1 — Write an idempotent MERGE for late orders (ex1)

Tables:

stg_orders(order_id, amount, status, event_ts)
fact_orders(order_id, amount, status, event_ts)

Task: Upsert so the row with the greatest event_ts wins. Deduplicate exact duplicates in staging by keeping the latest event_ts per order_id before the MERGE.

Exercise 2 — Plan a sliding window backfill (ex2)

Scenario: 95% of events arrive within 2 days, 99% within 5 days. Pick a default window for daily reprocessing and explain how you’d handle the rare >5 day arrivals.

Exercise checklist

Upsert is idempotent and uses event_ts to resolve conflicts.
Staging dedup selected the latest event per business key.
Backfill window aligns with observed lateness (P99).
Plan includes a path for extremely late records (manual or scheduled deep backfill).

Common mistakes and self-check

Using processing time instead of event time. Self-check: do your partitions and aggregations key off event_time?
No dedup before MERGE. Self-check: do you group staging by key to one winner?
Unbounded backfills. Self-check: are you limiting reprocessing to a sliding window, unless an explicit deep backfill ticket exists?
Non-deterministic updates. Self-check: is your tie-breaker stable (event_ts, version)?
Finalizing too early. Self-check: does your watermark match observed P99 lateness?

Practical projects

Build a daily revenue fact with a 7-day sliding backfill; publish a metric showing how much each day changed after late arrivals.
Implement SCD Type 2 for customer segments; write queries that return the correct segment for any event_date.
Create an audit table logging counts and sums per event_date for source vs warehouse; raise alerts when gaps exceed a threshold.

Mini challenge

You discover that a partner batch is delayed by 4 days this week. Your pipeline currently reprocesses the last 3 days and finalizes days older than that. What minimal change would you make to avoid undercounting this week while keeping cost in check? Write your answer as a short plan (1–3 steps).

Tip: Think about temporarily adjusting allowed lateness or scheduling a one-off backfill for the impacted dates.

Learning path

Start: Idempotent upserts and dedup by business key + event_ts.
Add: Sliding window backfills (N-day reprocessing) based on observed lateness.
Next: Watermarks and allowed lateness in streaming jobs.
Then: SCD Type 2 for dimensions with late updates.
Finally: Reconciliation dashboards and automated alerts.

Next steps

Complete the exercises, then take the quick test below.
Apply one pattern in a sandbox project (MERGE + 7-day backfill).
Review your organization’s lateness percentiles and propose an allowed lateness policy.

Menu

Handling Late Arriving Data

Table of Contents