Why this matters
Duplicate records and sudden spikes in nulls quietly break KPIs, dashboards, and machine learning features. As an Analytics Engineer, you ship reliable data models. These validations help you stop issues early—before stakeholders notice.
- Real task: Ensure fact tables have unique business keys before they feed revenue reports.
- Real task: Alert when email or user_id fields suddenly become null due to upstream schema changes.
- Real task: Deduplicate event streams where replays or late-arriving data cause double counts.
Who this is for
- Analytics Engineers and BI Developers maintaining production data models.
- Data Analysts who own critical dashboards and need trustworthy numbers.
- Data Engineers adding guardrails to ingestion layers.
Prerequisites
- Comfort with SQL (GROUP BY, window functions, CASE).
- Basic understanding of dimensional models (facts, dimensions, keys).
- Familiarity with daily batch or streaming ingestion patterns.
Concept explained simply
Duplicates: when the same "thing" appears more than once. You define what a "thing" is with a key (order_id, or user_id + event_time, etc.). Null spikes: when important fields become null much more often than usual.
Mental model
- Define: What is the unique key? What columns must be non-null?
- Measure: Count duplicates; compute null rate by time period or partition.
- Compare: Compare today to a baseline (yesterday, last 7 days avg, rolling window).
- Decide: Use thresholds to avoid false alarms.
- Act: Fail builds or alert, and route to a runbook.
Typical thresholds that work in practice
- Duplicates: ANY duplicates in a primary key should fail the model.
- Null spikes: Flag if null_rate > max(5%, 2x last 7-day average). Adjust for seasonality if needed.
What to validate (checklist)
- Primary/compound key uniqueness for facts and slowly changing dimensions.
- No duplicate natural keys per partition (e.g., per dt for append-only logs).
- Critical columns non-null: user_id, order_id, event_timestamp, price, email, etc.
- Null-rate stability over time for critical columns.
- No sudden drops in total rows or joins that inflate nulls.
Worked examples
Example 1: Primary key duplicates in orders
Goal: Ensure order_id is unique in analytics.orders.
SQL pattern
-- Duplicate order IDs (should return zero rows)
SELECT order_id, COUNT(*) AS duplicate_count
FROM analytics.orders
GROUP BY order_id
HAVING COUNT(*) > 1;
Alternative (window function):
SELECT *
FROM (
SELECT o.*, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM analytics.orders o
) t
WHERE t.rn > 1;
Action: If any rows return, fail the job and investigate source dedupe logic.
Example 2: Partition-level duplicates for event logs
Goal: Append-only table should have unique (user_id, event_time, event_type) PER DAY.
SQL pattern
SELECT dt, user_id, event_time, event_type, COUNT(*) AS c
FROM raw.events
GROUP BY dt, user_id, event_time, event_type
HAVING COUNT(*) > 1;
Action: If duplicates are expected from replays, keep latest by ingestion_time and drop older rows.
Keep-latest pattern
SELECT * EXCEPT(rn)
FROM (
SELECT e.*, ROW_NUMBER() OVER (
PARTITION BY dt, user_id, event_time, event_type
ORDER BY ingestion_time DESC
) rn
FROM raw.events e
) x
WHERE rn = 1;
Example 3: Detect a null spike in email
Goal: Flag days where email null-rate is unusually high.
SQL pattern (rolling baseline)
WITH daily AS (
SELECT
dt,
COUNT(*) AS total_rows,
SUM(CASE WHEN email IS NULL OR email = '' THEN 1 ELSE 0 END) AS null_rows,
SAFE_DIVIDE(SUM(CASE WHEN email IS NULL OR email = '' THEN 1 ELSE 0 END), COUNT(*)) AS null_rate
FROM analytics.users_snapshot
GROUP BY dt
), with_baseline AS (
SELECT
dt,
null_rate,
AVG(null_rate) OVER (
ORDER BY dt
ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) AS avg_prev_7
FROM daily
)
SELECT *
FROM with_baseline
WHERE null_rate > GREATEST(0.05, 2 * COALESCE(avg_prev_7, 0));
Notes: SAFE_DIVIDE is optional; you can use CASE WHEN total_rows=0 THEN 0 ELSE null_rows*1.0/total_rows END.
How to implement in your pipeline
- List critical keys and columns. For each model, write down business keys and required non-null columns.
- Write SQL checks. One query per check. Keep them small and explicit.
- Choose thresholds. Start strict for duplicates; start conservative for null spikes (5% or 2x baseline).
- Schedule and alert. Run checks after loads. If a check fails, alert and stop downstream models that depend on it.
- Document runbooks. For each failed check, include who to contact and common fixes.
Runbook template
- Symptom: Which check failed?
- Impact: Which dashboards/models rely on this table?
- Likely causes: Source schema change, upstream filter, replays.
- Immediate mitigation: Re-run ingestion, apply dedupe, or hotfix mapping.
- Root cause fix: Update mapping/transform, add permanent dedupe rule.
Exercises
Do these in your warehouse or a SQL scratchpad. See the Exercises section below for full instructions and solutions.
- Exercise 1: Detect duplicate order_id rows and output duplicate_count per order_id.
- Exercise 2: Compute daily null-rate for email and flag days where rate exceeds both 5% and 2x the average of the previous 7 days.
Self-check before you run
- Did you define the correct key(s) for uniqueness?
- Are you grouping by the right partition (e.g., dt) when needed?
- Do you guard against divide-by-zero when computing rates?
- Is your baseline window large enough (7–14 days) but not too large to miss recent changes?
Common mistakes and how to self-check
- Checking the wrong key. Self-check: Confirm the business key with domain experts or documentation.
- Ignoring partitions. Self-check: For append-only logs, validate uniqueness per partition (dt) not whole-table.
- Overreacting to small data days. Self-check: Require a minimum row threshold before flagging (e.g., total_rows > 1000).
- Using fixed thresholds only. Self-check: Compare against a rolling baseline to adapt to seasonality.
- Silencing alerts. Self-check: Track alert volume and fix root causes rather than raising thresholds indefinitely.
Practical projects
- Project A: Add duplicate-key checks to your top 3 fact tables. Make the job fail on any duplicate.
- Project B: Build a daily null-rate dashboard for 5 critical columns and add alert thresholds.
- Project C: Create a deduped event stream using row_number() with keep-latest logic, and compare metrics before/after.
Learning path
- Start here: Duplicate and null-spike checks on one critical model.
- Next: Add referential integrity checks (foreign keys) and row count drift checks.
- Then: Add anomaly detection (rolling averages, z-scores) and incident runbooks.
- Finally: Automate quality gates in CI and prod schedules with alerts.
Mini challenge
Your events table occasionally replays the last hour of data. Design a dedupe rule that keeps only the latest record per (user_id, event_time, event_type). Add a validation that fails if >0.5% of rows are dropped by dedupe (indicating a larger problem).
Hint
Use ROW_NUMBER() over the compound key and compare input vs. output row counts to compute the drop rate.
Next steps
- Apply these checks to two more models.
- Tune thresholds after 1–2 weeks of observations.
- Write a one-page runbook per check, including likely causes and owners.
About saving your progress
The quick test below is available to everyone. If you are logged in, your progress is saved automatically.