Why this matters
Example 3: Watermark and streaming offsets
Goal: Ensure events up to 10:00 have been processed into lake table events_raw.
-- Watermark completeness
SELECT MAX(event_time) AS max_event_time, COUNT(*) AS cnt
FROM lake.events_raw
WHERE event_time <= '2026-01-18T10:00:00Z';
-- Expected watermark = 10:00; raise alert if max_event_time < 10:00.
-- Kafka offset coverage (pseudo)
-- Source: topic X partition 0..N committed offsets
-- Target: persisted last_ingested_offset per partition
-- Check: target_offset >= source_committed_offset for all partitionsExample 4: Balance check for ledger
Goal: Verify debits equal credits per day and currency in the warehouse.
SELECT txn_date, currency,
ROUND(SUM(CASE WHEN side='DEBIT' THEN amount ELSE 0 END), 2) AS debits,
ROUND(SUM(CASE WHEN side='CREDIT' THEN amount ELSE 0 END), 2) AS credits
FROM dwh.ledger
WHERE txn_date = CURRENT_DATE - INTERVAL '1 day'
GROUP BY txn_date, currency
HAVING ABS(debits - credits) > 0.01; -- rows returned indicate mismatchesDesigning reconciliation in your architecture
- Pick invariants per dataset. Start with counts + key sums, then add hashes or domain checks.
- Define windows. Daily for batch; minute/hour for streaming. Include lateness policy.
- Capture controls at source and target. Store in a reconciliation_log with fields: dataset, window_start, window_end, record_count, numeric_sums, window_hash, watermark, offsets, status, run_id, computed_at.
- Compare and alert. Tolerances for rounding and expected lateness. Escalate only after X consecutive failures.
- Remediate. Support idempotent replay and backfill; track retries and final state.
- Observe over time. Trend counts and sums to spot drift, not just point failures.
What good looks like
- Automated controls for every critical dataset.
- Fast, cheap checks run on every load.
- Human-readable failure messages with next actions.
- Re-run is safe and deterministic.
Exercises
These match the tasks listed below. You can complete them with any SQL-capable environment or by writing pseudo-SQL.
Exercise 1 (ex1): Control totals
You ingested orders for 2026-01-15. Source landing shows 125,000 rows and total_amount sum of 3,487,912.57. Your warehouse table orders_fact for the same date shows 124,998 rows and total_amount sum of 3,487,912.57.
- Compute differences and decide if the load passes or fails.
- Write the minimal SQL or pseudo-logic to compare and set a status.
- Propose a remediation step if it fails.
Checklist
- Compared counts
- Compared sums with tolerance
- Clear pass/fail rule
- Remediation idea
Exercise 2 (ex2): Hash-based window check
For customer_dim on 2026-01-15, compute a stable window hash using customer_id, normalized name, and status. The source and target hashes differ.
- List two likely causes.
- Write a key-based diff query outline to find mismatched rows.
- Suggest a guard to make the hash stable.
Checklist
- Two plausible causes
- Diff outline by key
- Stability guard noted
Common mistakes and how to self-check
- Relying on counts only. Add at least one numeric sum or hash.
- Ignoring null/whitespace normalization in hashes. Always standardize.
- Full table row-by-row compares on large data. Use windows and partitions.
- No lateness policy. Watermark checks will false-alarm without it.
- Mixing event time and processing time. Reconcile on event time.
- Unbounded tolerances. Define small, explicit thresholds per metric.
- Not checking deletes. Include tombstones or compare key sets.
- One-off reconciliation scripts. Centralize into a reusable control log.
Self-check prompts
- Can I explain which invariant catches which failure mode?
- Do I have windowed checks for both batch and streaming?
- Are my checks cheap enough to run every load?
- Can I re-run a failed window safely?
Practical projects
- Build a reconciliation_log table and instrument two critical datasets with counts, sums, and hashes.
- Add a watermark check for one streaming topic and store committed offsets vs ingested offsets.
- Implement a daily balance check for a ledger-like dataset and auto-create a remediation backfill job.
Mini tasks
- Create a stable hash UDF that lowercases strings and replaces nulls.
- Write a query that lists top 5 days with largest count deltas.
- Design an alert message template with precise next actions.
Who this is for
- Data Architects defining quality standards.
- Data/Analytics Engineers building pipelines.
- Platform teams offering observability as a service.
Prerequisites
- Intermediate SQL (aggregation, joins, windowing).
- Familiarity with batch/streaming concepts.
- Basic understanding of CDC and partitions.
Learning path
- Start with control totals on 1–2 datasets.
- Add stable window hashes and key-based diffs for critical tables.
- Introduce watermark and offset checks for streaming.
- Centralize into a reconciliation_log with alert rules.
- Automate remediation (safe replays, backfills).
- Review thresholds quarterly with domain owners.
Next steps
- Extend to data contracts and schema evolution checks.
- Add referential coverage checks to star schemas.
- Trend metrics to detect silent drift early.
Mini challenge
You have a daily inventory snapshot (by product_id, warehouse_id) and a real-time stream of adjustments. Pick two reconciliation patterns to guarantee accuracy at end-of-day and explain how you would implement them with a window of the last 24 hours.
About the quick test
The quick test is available to everyone. Only logged-in users will see saved progress in their account.