How to learn Handling Data Latency And Cutoffs for ETL Requirements And Source Analysis in ETL Developer for free

Who this is for

ETL/ELT developers defining load windows and data freshness guarantees.
Analysts or data product owners who need reliable cutoffs for reports and dashboards.
Engineers building batch or streaming jobs with late-arriving data.

Prerequisites

Basic SQL (filters, window functions, upserts/merge).
Familiarity with batch schedules and/or micro-batch streaming.
Understanding of source data timestamps and partitions.

Learning path

Learn key concepts: latency, cutoff, watermark, grace period.
Apply patterns: time/partition/sequence cutoffs and late-data strategies.
Work through the 3 examples below and implement the exercises.
Add monitoring and reconciliation checks.
Practice with the Quick Test to confirm understanding.

Why this matters

Real-world tasks you will face:

Defining the exact time a daily report “closes” so numbers don’t change unexpectedly.
Choosing a safe delay (watermark) so streaming dashboards show near-real-time but stable metrics.
Handling files that arrive late without double-counting or missing records.
Re-running jobs safely after source outages.

Concept explained simply

Data latency: how long it takes for data to show up where you need it (e.g., from event creation to warehouse availability).
Cutoff: a boundary that says “this run includes data up to X, anything later goes to the next run.”
Watermark: a moving limit of event_time you trust. Often set to now minus a safety lag (e.g., 45 minutes).
Grace period: time you allow late data to be accepted into a closed window via upserts without breaking users’ expectations.
SLA/SLO: the freshness and delivery times you commit to (e.g., daily by 03:00 UTC, or dashboard ≤15 minutes stale).

Mental model

Imagine a conveyor belt of events. You place a gate (cutoff) that lets items through until a certain time or sequence. You move the gate forward each run. A watermark is how far back you look to make sure most items have arrived. Some stragglers still come; you handle them with upserts or a small correction job.

Core strategies

Time-based cutoff: include records with event_time ≤ cutoff_time.
Partition-based cutoff: include partitions up to a specific date/hour (e.g., all partitions ≤ 2026-01-10/23).
Sequence/offset cutoff: include CDC changes up to LSN/OFFSET N; next run resumes from N+1.
Watermarking: choose a safety lag from processing time to event time (e.g., now − 60m) based on lateness percentiles and SLA.
Late data handling:
- Upsert into fact tables within a grace period.
- Quarantine extremely late or corrupt records for manual review.
- Scheduled backfills for known incidents.
Idempotency: design loads so re-runs don’t duplicate data (use natural keys, dedupe windows, or merge semantics).
Reconciliation: row counts, hash totals, sums by key, and “completeness” vs expected manifests.

How to pick a cutoff and watermark (quick guide)

Get lateness distribution (P95, P99, max) for the last 2–4 weeks.
Map business freshness need: is correctness or speed more important?
Pick watermark lag to cover at least P99 if accuracy is critical; P95 if speed matters and you accept minor corrections.
Define grace period: how long you allow late updates to change metrics.
Document rerun rules and idempotency safeguards.

Worked examples

Example 1 — Daily sales cutoff (batch)

Scenario: POS systems upload sales throughout the night. Historically, all files arrive by 02:20 UTC. SLA: publish daily report by 03:00 UTC.

Cutoff: 02:30 UTC (10-minute buffer beyond observed latest arrival).
Schedule: load starts 02:35; publish by 03:00.
Late arrivals: any sales with sale_time ≤ yesterday 23:59:59 but files arriving after 02:30 are upserted in a 24-hour grace window.

SQL sketch

-- Load fact for day D
insert into fact_sales (sale_id, sale_time, amount, store_id)
select s.*
from staging_sales s
where date(s.sale_time) = date_add('day', -1, current_date)
  and s.ingest_time <= timestamp with time zone 'today 02:30 UTC';

-- Late corrections (grace window)
merge into fact_sales f using staging_sales_late l
  on f.sale_id = l.sale_id
when matched then update set amount = l.amount
when not matched then insert (sale_id, sale_time, amount, store_id)
values (l.sale_id, l.sale_time, l.amount, l.store_id);

Example 2 — Streaming dashboard with watermark

Scenario: order events stream in with variable delay. P95 lateness = 12 minutes; P99 = 55 minutes. Dashboard needs stable hourly KPIs with ≥99% completeness.

Watermark: now − 60 minutes (covers P99 with small buffer).
Aggregation window: compute metrics for last hour using event_time ≤ watermark.
Late arrivals: upsert corrections in the next micro-batch; track delta corrections so users know changes are small.

SQL sketch

with params as (
  select (current_timestamp - interval '60 minutes') as watermark
), filtered as (
  select * from raw_orders r
  where r.event_time <= (select watermark from params)
)
select date_trunc('hour', event_time) as hour_bucket,
       count(*) as orders,
       sum(amount) as revenue
from filtered
where event_time > (current_timestamp - interval '2 hours')
group by 1;

Example 3 — CDC cutoff by sequence

Scenario: you ingest customer updates via CDC with a monotonically increasing LSN/OFFSET.

Cutoff: read changes up to max_lsn_seen in this run; store it. Next run starts after it.
Idempotency: MERGE by business key; stage by run_id to avoid partial commits.

SQL sketch

-- Stage increment up to cutoff_lsn
insert into stage_customer_changes
select * from cdc_source
where lsn > :last_committed_lsn and lsn <= :cutoff_lsn;

-- Apply idempotent upsert
merge into dim_customer d using stage_customer_changes s
  on d.customer_id = s.customer_id
when matched then update set ...
when not matched then insert (...)
values (...);

-- Persist new last_committed_lsn = :cutoff_lsn

Practical checklist

Have you measured lateness percentiles (P95, P99) for the last 2–4 weeks?
Is the cutoff documented in local and UTC times to avoid DST confusion?
Is your watermark lag justified by data and SLA?
Do you have a grace period and upsert path for late data?
Is the load idempotent (merge/dedupe) and safe to re-run?
Do you log the exact boundaries used: run_id, cutoff_time, watermark, or offset?
Are reconciliation checks in place (counts, sums, expected files/partitions)?

Exercises

Exercise 1 (ex1) — Choose a watermark and write a filter

Scenario: order events: P95 lateness 12m, P99 55m, rare outliers 2h. Requirement: Hourly KPIs should be ≥99% complete and stable. Choose a watermark lag and write SQL to filter events using event_time ≤ watermark.

Deliverables: watermark lag (minutes) and a SELECT with the filter.

Need a hint?

Cover at least P99; add a small buffer.
Compute watermark as now minus lag.

Exercise 2 (ex2) — Define a daily batch cutoff and rerun rules

Scenario: A vendor drops CSV files between 00:45–02:15 UTC; historically all by 02:20. SLA: “Publish daily by 03:00.” Design: cutoff time, job start, late-data policy, rerun window, and two reconciliation checks.

Deliverables: list the times and policies; include how you prevent double loads.

Need a hint?

Cutoff should exceed observed latest arrival.
Use a manifest or checksum to avoid re-loading the same file.

Common mistakes and self-check

Mistake: Using processing_time instead of event_time for cutoffs in event data. Fix: Always filter by event_time (or business timestamp) when correctness matters.
Mistake: No buffer beyond observed latest arrival. Fix: Add 5–20 minutes buffer; revisit monthly.
Mistake: No idempotency. Fix: Use MERGE, natural keys, or dedupe windows; store run boundaries.
Mistake: Silent late data. Fix: Track late-arrival counts, P95/P99 lateness, and alert on spikes.
Mistake: DST/timezone confusion. Fix: Standardize to UTC and document local interpretations.

Self-check prompts

Can you state the exact cutoff and watermark your job used last run?
If you re-run the job, will counts and sums remain consistent?
How do you detect and handle a file arriving after cutoff?

Practical projects

Build a daily ETL with a 02:30 UTC cutoff, MERGE-based late correction, and a reconciliation table logging counts and sums.
Create a micro-batch pipeline that uses a 60-minute watermark and produces hourly aggregates with late-arrival corrections.
Implement a backfill script that reprocesses a date range with fixed cutoffs and validates before/after counts.

Quick Test and progress

Take the Quick Test below to check your understanding. Available to everyone. If you are logged in, your progress will be saved automatically.

Next steps

Add automated alerts for lateness P95/P99 and missing partitions.
Extend logic to slowly changing dimensions and snapshot facts.
Document rerun and backfill procedures with example commands and expected outcomes.

Mini challenge

Marketing ad clicks arrive within 5 minutes P95, but conversions (server-side) lag 45 minutes P95. You need a near-real-time ROAS dashboard updated every 10 minutes and a stable daily report. Propose: a dashboard watermark lag, a daily cutoff, grace periods, and how late conversions update ROAS without double-counting. Write your plan in 5–7 bullet points.

Menu

Handling Data Latency And Cutoffs

Table of Contents

Who this is for

Prerequisites

Learning path

Why this matters

Concept explained simply

Mental model

Core strategies

Worked examples

Example 1 — Daily sales cutoff (batch)

Example 2 — Streaming dashboard with watermark

Example 3 — CDC cutoff by sequence

Practical checklist

Exercises

Exercise 1 (ex1) — Choose a watermark and write a filter

Exercise 2 (ex2) — Define a daily batch cutoff and rerun rules

Common mistakes and self-check

Practical projects

Quick Test and progress

Next steps

Mini challenge

Practice Exercises

Choose a watermark and write a filter

Instructions

Expected Output

Define a daily batch cutoff and rerun rules

Handling Data Latency And Cutoffs — Quick Test

Have questions about Handling Data Latency And Cutoffs?

AI Assistant