Why this matters
In real pipelines, events and facts do not always arrive on time or in order. Payments can post hours later, mobile events sync days after a user comes back online, and upstream backfills can replay old data. As an Analytics Engineer, you must keep metrics correct when data is late by designing SQL that is deduplicated, merge-safe, and aggregate-safe.
- Daily revenue should update when a late payment arrives.
- User funnels should count each user once, even if events come out of order.
- Dimensions must maintain correct history when facts land late.
Concept explained simply
Late arriving data is data that belongs to an earlier point in time (event_time) but is processed later (ingest_time). It may also be out-of-order or replayed multiple times. Your job is to make transformations idempotent (safe to run repeatedly) and time-aware.
Mental model
Think of your pipeline like a train station:
- event_time: the scheduled departure printed on the ticket.
- ingest_time: when the passenger actually walks into the station.
- watermark: the time after which you consider a train's schedule closed (e.g., we accept events up to 7 days late).
- idempotency: if a passenger shows the same ticket twice, you still count them once.
Key definitions
- Deduplication: keep the best row per business key (e.g., order_id), usually the one with the latest event_time or a higher version number.
- Upsert/MERGE: insert new keys; update existing keys when late data modifies them.
- Partition repair: when late data arrives for 2025-01-10, recompute just that day (or a rolling window) instead of the whole table.
- Watermark window: the maximum lateness you support (e.g., 7 days). Older events go to an adjustments table or a special backfill process.
Core SQL patterns you will use
1) Deduplicate late-arriving events
WITH ranked AS (
SELECT
e.*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY event_time DESC, load_ts DESC
) AS rn
FROM staging_events e
)
SELECT *
FROM ranked
WHERE rn = 1;
Why it works: For each business key (order_id), you keep only the most authoritative row. Use event_time first; if ties, break using load_ts or a version field.
2) Idempotent upsert with MERGE
MERGE INTO fact_orders f
USING deduped_events s
ON f.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
f.status = s.status,
f.amount = s.amount,
f.event_time = s.event_time,
f.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (order_id, status, amount, event_time, updated_at)
VALUES (s.order_id, s.status, s.amount, s.event_time, CURRENT_TIMESTAMP);
This handles new orders and fixes existing ones if late rows change their attributes.
3) Aggregate with lateness in mind
-- Recompute only the days that could be affected by new late data
WITH affected_days AS (
SELECT DISTINCT CAST(event_time AS DATE) AS d
FROM deduped_events
WHERE event_time >= CURRENT_DATE - INTERVAL '7' DAY -- watermark
),
base AS (
SELECT CAST(event_time AS DATE) AS d, SUM(amount) AS revenue
FROM fact_orders
WHERE CAST(event_time AS DATE) IN (SELECT d FROM affected_days)
GROUP BY 1
)
-- Replace partitions/days in target aggregate
MERGE INTO agg_daily_revenue a
USING base b
ON a.dt = b.d
WHEN MATCHED THEN UPDATE SET a.revenue = b.revenue
WHEN NOT MATCHED THEN INSERT (dt, revenue) VALUES (b.d, b.revenue);
Effect: Only the last 7 days are recalculated. Older days are frozen unless a special backfill runs.
Worked examples (3)
Example 1 — Late mobile events
Scenario: App is offline; events sync 3 days later. You deduplicate by (user_id, session_id, event_name) keeping the latest event_time.
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id, session_id, event_name
ORDER BY event_time DESC, load_ts DESC
) AS rn
FROM app_events_raw
)
SELECT user_id, session_id, event_name, event_time
FROM ranked
WHERE rn = 1;
Example 2 — Late fact adjusting a dimension SCD Type 2
Scenario: A late address change arrives for 2025-02-01; current dimension range is 2025-01-01..9999-12-31.
-- Close the previous row
UPDATE dim_customer d
SET valid_to = DATE '2025-01-31'
WHERE d.customer_id = :cid
AND d.valid_from <= DATE '2025-02-01'
AND d.valid_to > DATE '2025-02-01';
-- Insert the late row
INSERT INTO dim_customer (customer_id, address, valid_from, valid_to, is_current)
VALUES (:cid, :new_address, DATE '2025-02-01', DATE '9999-12-31', TRUE);
-- Ensure only one current row
UPDATE dim_customer
SET is_current = (valid_to = DATE '9999-12-31')
WHERE customer_id = :cid;
Key: Use date ranges and make updates idempotent.
Example 3 — Daily revenue that corrects itself
Rebuild only recent days using a 10-day watermark.
WITH recent_orders AS (
SELECT * FROM fact_orders
WHERE CAST(event_time AS DATE) >= CURRENT_DATE - INTERVAL '10' DAY
),
agg AS (
SELECT CAST(event_time AS DATE) AS dt, SUM(amount) AS revenue
FROM recent_orders
GROUP BY 1
)
MERGE INTO agg_daily_revenue t
USING agg s
ON t.dt = s.dt
WHEN MATCHED THEN UPDATE SET t.revenue = s.revenue
WHEN NOT MATCHED THEN INSERT (dt, revenue) VALUES (s.dt, s.revenue);
Step-by-step: implement in your warehouse
- Create a staging table with raw loads including a load_ts or ingestion_id.
- Build a deduped CTE using ROW_NUMBER() over your business key(s) ordered by event_time, load_ts.
- MERGE deduped rows into a canonical fact table.
- Aggregate into daily/weekly tables by event_time, not load_ts.
- Repair only affected partitions/dates within your watermark window.
- Route data older than the watermark to an adjustments or backfill process.
- Schedule incremental runs; ensure queries are idempotent.
Checklist: production readiness
- Business keys clearly defined (e.g., order_id).
- Both event_time and load_ts available in staging.
- Dedup logic deterministic, tested with ties.
- MERGE statements safe to rerun.
- Aggregates recompute only affected dates.
- Watermark agreed with stakeholders (e.g., 7 days).
- Audit columns (inserted_at, updated_at) present.
Exercises
You can practice directly below. The quick test at the end will check your understanding. Everyone can take it; if you log in, we'll save your progress.
Exercise ex1 — Deduplicate and upsert late orders
Tables:
- events_raw(order_id, status, amount, event_time, load_ts)
- fact_orders(order_id, status, amount, event_time, updated_at)
Task:
- Write a CTE that picks one row per order_id using event_time DESC, then load_ts DESC.
- MERGE the deduped results into fact_orders so late rows update existing orders.
- Rebuild agg_daily_revenue for affected days within the last 7 days.
Sample input rows
events_raw
1, 'created', 50.00, '2025-03-01 09:00', '2025-03-01 09:01'
1, 'paid', 50.00, '2025-03-01 09:05', '2025-03-01 10:00' -- late but correct
2, 'paid', 12.00, '2025-02-27 12:00', '2025-03-03 08:00' -- very late
Common mistakes and self-check
- Mistake: Aggregating by load_ts instead of event_time. Self-check: Does yesterday's revenue change when a late payment arrives? It should.
- Mistake: Not handling duplicates. Self-check: For a known order_id, count rows after dedup; should be 1.
- Mistake: Recomputing the whole history daily. Self-check: Confirm your aggregate query filters to a watermark window.
- Mistake: Using non-deterministic tie-breakers. Self-check: Ties must break on load_ts or version, never randomly.
- Mistake: Forgetting to update dimension validity ranges. Self-check: No overlapping SCD Type 2 ranges for the same key.
Practical projects
- Build a late-safe orders mart: raw -> deduped staging -> fact_orders (MERGE) -> agg_daily_revenue with a 7-day watermark.
- Create a funnel table where each user-session pair keeps only the latest event sequence; verify conversion rates stay stable after late arrivals.
- Implement an adjustments table for events older than 30 days and a monthly job that reconciles historical aggregates.
Who this is for, prerequisites, and learning path
Who this is for
- Analytics Engineers who own ELT models and BI extracts.
- Data Analysts maintaining metric quality with evolving data.
Prerequisites
- Comfort with SQL joins, window functions, and aggregates.
- Basic understanding of MERGE/UPSERT patterns.
Learning path
- Master deduplication with window functions.
- Practice idempotent MERGE for facts and SCD Type 2 for dimensions.
- Design aggregate rebuilds using watermarks and partition repair.
- Add audits and monitoring for late-arrival impact.
Next steps
- Complete the exercise and verify outputs.
- Take the quick test to confirm your understanding.
- Apply the patterns to your current pipeline with a small 7–10 day watermark.
Mini challenge
Your web analytics events can arrive up to 5 days late. Draft SQL that rebuilds only impacted dates for sessions and pageviews while ensuring each session_id is unique. Include: a dedup CTE, a MERGE into the canonical events table, and an aggregate refresh for recent days.
Quick Test
Take the short quiz to check your understanding. The quick test is available to everyone; if you log in, we’ll save your progress.