Who this is for
- Data engineers building batch or streaming pipelines.
- Analytics engineers/BI developers who need accurate, backfillable metrics.
- Platform engineers designing ingestion and warehouse models.
Prerequisites
- Comfort with SQL (SELECT, JOIN, GROUP BY, MERGE/UPSERT).
- Basic understanding of partitioned data and incremental loads.
- Familiarity with event timestamps vs load time.
Why this matters
In real systems, data often arrives late due to mobile offline usage, partner delays, retries, or upstream outages. As a Data Engineer, you will:
- Produce accurate daily/weekly metrics even when events come days late.
- Design idempotent upserts and backfills without double counting.
- Set watermarks so reports finalize on time but still accept legitimate late data.
- Maintain dimensional history when updates appear out of order.
Concept explained simply
Late-arriving data is any record whose event time is in the past relative to when it is processed. Think of sending birthday cards: some arrive on the day, others arrive days later. You still want the count of cards per day to be correct—eventually.
Key terms
- Event time: when the event actually happened.
- Processing time: when your pipeline sees the event.
- Watermark: a moving threshold of event time; data older than it is considered final.
- Allowed lateness: how far behind the watermark you still accept updates.
- Idempotent load: running it multiple times yields the same result.
- Backfill: reprocessing historical partitions to incorporate late/corrected data.
- Upsert/MERGE: insert new rows, update existing ones deterministically.
Mental model
Picture a conveyor belt of dated envelopes (events). You sort them into bins by the envelope date (event_date), not when they arrived. You keep the bin for the recent days open (allowed lateness) and seal older bins (watermark passed). If something rare arrives far later, you run a special re-sort (backfill) for that specific bin.
Design patterns for late data
1) Watermarks and allowed lateness
- Streaming: compute windows by event time; set allowed lateness (e.g., 3 days). Emit updates until the watermark passes, then finalize.
- Batch: on each daily run, reprocess the last N days of event_date partitions (sliding window) to capture late arrivals.
2) Upserts with a deterministic rule
- Use a stable business key (e.g., order_id) plus event_ts to decide the winner.
- Rule of thumb: keep the row with the greatest event_ts (or highest version).
3) Deduplication
- Exact duplicates: same business key and event_ts; keep one.
- Near-duplicates: choose a tie-breaker (ingestion_id, sequence, or latest event_ts).
4) Backfills and reprocessing
- Partition downstream fact tables by event_date.
- When late data arrives, re-run only affected partitions (e.g., last 7–14 days).
- Keep backfills idempotent and repeatable.
5) Slowly Changing Dimensions (SCD)
- Type 2: use valid_from/valid_to and a current flag. Late updates opening in the past should close prior versions and insert a corrected version.
- Type 1: overwrite if history isn’t needed.
6) Append-only versioned facts
- Store corrections as new rows with a correction_type (original, late, cancel, replace) and compute the current truth in views.
7) Reconciliation and monitoring
- Maintain audit counts per event_date (source vs processed) to spot gaps.
- Alert when lateness exceeds the agreed threshold.
Worked examples
Example 1 — Streaming window revenue with allowed lateness
Goal: daily revenue by event_time, accept events up to 3 days late.
// Pseudocode
window = tumble(event_time, 1 day)
allowed_lateness = interval '3' day
// Emit updates when late events arrive within 3 days
// After watermark passes (event_time < current_time - 3d), close the day.
Outcome: dashboards update for three days, then the day is finalized.
Example 2 — Batch sliding window backfill
Goal: Each night, recompute the last 7 event_date partitions.
-- Pseudocode for scheduler
for d in dates(today - 6, today):
rebuild_partition('fact_orders', event_date = d)
Outcome: any late events within a week are included automatically.
Example 3 — Idempotent MERGE for late updates
Goal: keep the latest event per order_id based on event_ts.
MERGE INTO warehouse.orders t
USING staging.orders_incr s
ON t.order_id = s.order_id
WHEN MATCHED AND s.event_ts > t.event_ts THEN
UPDATE SET amount = s.amount,
status = s.status,
event_ts = s.event_ts
WHEN NOT MATCHED THEN
INSERT (order_id, amount, status, event_ts)
VALUES (s.order_id, s.amount, s.status, s.event_ts);
Outcome: late records correctly update prior rows; re-running is safe.
Example 4 — SCD Type 2 with a late attribute change
-- If a customer's segment changed effective 2024-12-15 but arrived on 2024-12-20
-- Close prior version and insert new one
UPDATE dim_customer
SET valid_to = '2024-12-15', is_current = false
WHERE customer_id = :id AND is_current = true AND valid_from < '2024-12-15';
INSERT INTO dim_customer (customer_id, segment, valid_from, valid_to, is_current)
VALUES (:id, :new_segment, '2024-12-15', null, true);
Outcome: queries by event_date see the correct segment historically.
Step-by-step: choose a strategy
- Map lateness. Measure arrival lag percentiles (P50/P90/P99) between event_time and ingestion_time.
- Pick an allowed lateness. Choose N days that balances completeness and timeliness (e.g., P99).
- Partition by event_date. Enables selective backfills.
- Define upsert rule. Decide conflict resolution (latest event_ts or version).
- Automate sliding reprocessing. Rebuild last N partitions every run.
- Add reconciliation. Compare source vs target counts and amounts per day.
Exercises
Do these to practice. The quick test is available to everyone; only logged-in users get saved progress.
- Exercise 1 mirrors "ex1" below.
- Exercise 2 mirrors "ex2" below.
Exercise 1 — Write an idempotent MERGE for late orders (ex1)
Tables:
stg_orders(order_id, amount, status, event_ts)fact_orders(order_id, amount, status, event_ts)
Task: Upsert so the row with the greatest event_ts wins. Deduplicate exact duplicates in staging by keeping the latest event_ts per order_id before the MERGE.
Exercise 2 — Plan a sliding window backfill (ex2)
Scenario: 95% of events arrive within 2 days, 99% within 5 days. Pick a default window for daily reprocessing and explain how you’d handle the rare >5 day arrivals.
Exercise checklist
- Upsert is idempotent and uses event_ts to resolve conflicts.
- Staging dedup selected the latest event per business key.
- Backfill window aligns with observed lateness (P99).
- Plan includes a path for extremely late records (manual or scheduled deep backfill).
Common mistakes and self-check
- Using processing time instead of event time. Self-check: do your partitions and aggregations key off event_time?
- No dedup before MERGE. Self-check: do you group staging by key to one winner?
- Unbounded backfills. Self-check: are you limiting reprocessing to a sliding window, unless an explicit deep backfill ticket exists?
- Non-deterministic updates. Self-check: is your tie-breaker stable (event_ts, version)?
- Finalizing too early. Self-check: does your watermark match observed P99 lateness?
Practical projects
- Build a daily revenue fact with a 7-day sliding backfill; publish a metric showing how much each day changed after late arrivals.
- Implement SCD Type 2 for customer segments; write queries that return the correct segment for any event_date.
- Create an audit table logging counts and sums per event_date for source vs warehouse; raise alerts when gaps exceed a threshold.
Mini challenge
You discover that a partner batch is delayed by 4 days this week. Your pipeline currently reprocesses the last 3 days and finalizes days older than that. What minimal change would you make to avoid undercounting this week while keeping cost in check? Write your answer as a short plan (1–3 steps).
Tip: Think about temporarily adjusting allowed lateness or scheduling a one-off backfill for the impacted dates.
Learning path
- Start: Idempotent upserts and dedup by business key + event_ts.
- Add: Sliding window backfills (N-day reprocessing) based on observed lateness.
- Next: Watermarks and allowed lateness in streaming jobs.
- Then: SCD Type 2 for dimensions with late updates.
- Finally: Reconciliation dashboards and automated alerts.
Next steps
- Complete the exercises, then take the quick test below.
- Apply one pattern in a sandbox project (MERGE + 7-day backfill).
- Review your organization’s lateness percentiles and propose an allowed lateness policy.