How to learn Validations For Duplicates And Null Spikes for Data Quality Frameworks in Analytics Engineer for free

Why this matters

Duplicate records and sudden spikes in nulls quietly break KPIs, dashboards, and machine learning features. As an Analytics Engineer, you ship reliable data models. These validations help you stop issues early—before stakeholders notice.

Real task: Ensure fact tables have unique business keys before they feed revenue reports.
Real task: Alert when email or user_id fields suddenly become null due to upstream schema changes.
Real task: Deduplicate event streams where replays or late-arriving data cause double counts.

Who this is for

Analytics Engineers and BI Developers maintaining production data models.
Data Analysts who own critical dashboards and need trustworthy numbers.
Data Engineers adding guardrails to ingestion layers.

Prerequisites

Comfort with SQL (GROUP BY, window functions, CASE).
Basic understanding of dimensional models (facts, dimensions, keys).
Familiarity with daily batch or streaming ingestion patterns.

Concept explained simply

Duplicates: when the same "thing" appears more than once. You define what a "thing" is with a key (order_id, or user_id + event_time, etc.). Null spikes: when important fields become null much more often than usual.

Mental model

Define: What is the unique key? What columns must be non-null?
Measure: Count duplicates; compute null rate by time period or partition.
Compare: Compare today to a baseline (yesterday, last 7 days avg, rolling window).
Decide: Use thresholds to avoid false alarms.
Act: Fail builds or alert, and route to a runbook.

Typical thresholds that work in practice

Duplicates: ANY duplicates in a primary key should fail the model.
Null spikes: Flag if null_rate > max(5%, 2x last 7-day average). Adjust for seasonality if needed.

What to validate (checklist)

Primary/compound key uniqueness for facts and slowly changing dimensions.
No duplicate natural keys per partition (e.g., per dt for append-only logs).
Critical columns non-null: user_id, order_id, event_timestamp, price, email, etc.
Null-rate stability over time for critical columns.
No sudden drops in total rows or joins that inflate nulls.

Worked examples

Example 1: Primary key duplicates in orders

Goal: Ensure order_id is unique in analytics.orders.

SQL pattern

-- Duplicate order IDs (should return zero rows)
SELECT order_id, COUNT(*) AS duplicate_count
FROM analytics.orders
GROUP BY order_id
HAVING COUNT(*) > 1;

Alternative (window function):

SELECT *
FROM (
  SELECT o.*, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM analytics.orders o
) t
WHERE t.rn > 1;

Action: If any rows return, fail the job and investigate source dedupe logic.

Example 2: Partition-level duplicates for event logs

Goal: Append-only table should have unique (user_id, event_time, event_type) PER DAY.

SQL pattern

SELECT dt, user_id, event_time, event_type, COUNT(*) AS c
FROM raw.events
GROUP BY dt, user_id, event_time, event_type
HAVING COUNT(*) > 1;

Action: If duplicates are expected from replays, keep latest by ingestion_time and drop older rows.

Keep-latest pattern

SELECT * EXCEPT(rn)
FROM (
  SELECT e.*, ROW_NUMBER() OVER (
    PARTITION BY dt, user_id, event_time, event_type
    ORDER BY ingestion_time DESC
  ) rn
  FROM raw.events e
) x
WHERE rn = 1;

Example 3: Detect a null spike in email

Goal: Flag days where email null-rate is unusually high.

SQL pattern (rolling baseline)

WITH daily AS (
  SELECT
    dt,
    COUNT(*) AS total_rows,
    SUM(CASE WHEN email IS NULL OR email = '' THEN 1 ELSE 0 END) AS null_rows,
    SAFE_DIVIDE(SUM(CASE WHEN email IS NULL OR email = '' THEN 1 ELSE 0 END), COUNT(*)) AS null_rate
  FROM analytics.users_snapshot
  GROUP BY dt
), with_baseline AS (
  SELECT
    dt,
    null_rate,
    AVG(null_rate) OVER (
      ORDER BY dt
      ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
    ) AS avg_prev_7
  FROM daily
)
SELECT *
FROM with_baseline
WHERE null_rate > GREATEST(0.05, 2 * COALESCE(avg_prev_7, 0));

Notes: SAFE_DIVIDE is optional; you can use CASE WHEN total_rows=0 THEN 0 ELSE null_rows*1.0/total_rows END.

How to implement in your pipeline

List critical keys and columns. For each model, write down business keys and required non-null columns.
Write SQL checks. One query per check. Keep them small and explicit.
Choose thresholds. Start strict for duplicates; start conservative for null spikes (5% or 2x baseline).
Schedule and alert. Run checks after loads. If a check fails, alert and stop downstream models that depend on it.
Document runbooks. For each failed check, include who to contact and common fixes.

Runbook template

Symptom: Which check failed?
Impact: Which dashboards/models rely on this table?
Likely causes: Source schema change, upstream filter, replays.
Immediate mitigation: Re-run ingestion, apply dedupe, or hotfix mapping.
Root cause fix: Update mapping/transform, add permanent dedupe rule.

Exercises

Do these in your warehouse or a SQL scratchpad. See the Exercises section below for full instructions and solutions.

Exercise 1: Detect duplicate order_id rows and output duplicate_count per order_id.
Exercise 2: Compute daily null-rate for email and flag days where rate exceeds both 5% and 2x the average of the previous 7 days.

Self-check before you run

Did you define the correct key(s) for uniqueness?
Are you grouping by the right partition (e.g., dt) when needed?
Do you guard against divide-by-zero when computing rates?
Is your baseline window large enough (7–14 days) but not too large to miss recent changes?

Common mistakes and how to self-check

Checking the wrong key. Self-check: Confirm the business key with domain experts or documentation.
Ignoring partitions. Self-check: For append-only logs, validate uniqueness per partition (dt) not whole-table.
Overreacting to small data days. Self-check: Require a minimum row threshold before flagging (e.g., total_rows > 1000).
Using fixed thresholds only. Self-check: Compare against a rolling baseline to adapt to seasonality.
Silencing alerts. Self-check: Track alert volume and fix root causes rather than raising thresholds indefinitely.

Practical projects

Project A: Add duplicate-key checks to your top 3 fact tables. Make the job fail on any duplicate.
Project B: Build a daily null-rate dashboard for 5 critical columns and add alert thresholds.
Project C: Create a deduped event stream using row_number() with keep-latest logic, and compare metrics before/after.

Learning path

Start here: Duplicate and null-spike checks on one critical model.
Next: Add referential integrity checks (foreign keys) and row count drift checks.
Then: Add anomaly detection (rolling averages, z-scores) and incident runbooks.
Finally: Automate quality gates in CI and prod schedules with alerts.

Mini challenge

Your events table occasionally replays the last hour of data. Design a dedupe rule that keeps only the latest record per (user_id, event_time, event_type). Add a validation that fails if >0.5% of rows are dropped by dedupe (indicating a larger problem).

Hint

Use ROW_NUMBER() over the compound key and compare input vs. output row counts to compute the drop rate.

Next steps

Apply these checks to two more models.
Tune thresholds after 1–2 weeks of observations.
Write a one-page runbook per check, including likely causes and owners.

About saving your progress

The quick test below is available to everyone. If you are logged in, your progress is saved automatically.

Menu

Validations For Duplicates And Null Spikes

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

What to validate (checklist)

Worked examples

Example 1: Primary key duplicates in orders

Example 2: Partition-level duplicates for event logs

Example 3: Detect a null spike in email

How to implement in your pipeline

Exercises

Common mistakes and how to self-check

Practical projects

Learning path

Mini challenge

Next steps

About saving your progress

Practice Exercises

Detect and list duplicate order IDs

Instructions

Expected Output

Detect a daily null spike in email

Validations For Duplicates And Null Spikes — Quick Test

Have questions about Validations For Duplicates And Null Spikes?

AI Assistant