Topic Not Found

Why this matters

As a Product Analyst, your charts, experiments, and business decisions are only as good as the data behind them. Data quality checks help you catch issues early: missing rows after a pipeline change, unexpected spikes of nulls, duplicate events inflating metrics, or broken joins causing orphaned records.

Daily dashboards: Confirm counts and key metrics haven’t silently drifted.
Experiment analysis: Ensure assignment and events are complete and deduplicated.
Revenue reporting: Validate totals, currencies, and outliers before leadership sees them.

Real-world examples you’ll face

New ETL step halves yesterday’s orders — detect with row-count reconciliation.
Client-side bug doubles event sends — catch via duplicate-key checks.
Tracking change introduces null user_id — spot via null-rate thresholds.

Concept explained simply

Data quality checks are short SQL queries that verify assumptions about your data: uniqueness, completeness, validity, and consistency. Think of them as small unit tests for your warehouse tables.

Mental model

Uniqueness: “Each primary key appears once.”
Completeness: “We didn’t lose rows vs. source.”
Validity: “Values are in expected ranges/enums.”
Consistency: “Foreign keys match parent tables.”

Pro tip

Prefer stable checks (counts, distinct IDs, null rates) over fragile ones. Start small and automate the most business-critical checks first.

Worked examples

1) Duplicate primary keys

-- Find duplicate order IDs
SELECT id AS duplicate_id, COUNT(*) AS cnt
FROM orders_raw
GROUP BY id
HAVING COUNT(*) > 1;

Why: Duplicates can inflate metrics. This query lists any IDs appearing more than once.

2) Row-count reconciliation by day

WITH raw_daily AS (
  SELECT CAST(created_at AS DATE) AS d, COUNT(*) AS raw_count
  FROM orders_raw
  GROUP BY 1
), clean_daily AS (
  SELECT order_date AS d, COUNT(*) AS clean_count
  FROM orders_clean
  GROUP BY 1
)
SELECT COALESCE(r.d, c.d) AS d,
       COALESCE(r.raw_count, 0) AS raw_count,
       COALESCE(c.clean_count, 0) AS clean_count,
       COALESCE(c.clean_count, 0) - COALESCE(r.raw_count, 0) AS diff
FROM raw_daily r
FULL OUTER JOIN clean_daily c USING (d)
ORDER BY d;

Why: Ensures your clean table didn’t lose or gain rows compared to the source.

3) Null-rate and value-range checks

-- Null-rate check for user_id
SELECT
  COUNT(*) AS total_rows,
  SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS null_user_id,
  100.0 * SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*),0) AS null_pct
FROM events;

-- Value range check for order amount
SELECT COUNT(*) AS out_of_range
FROM orders_clean
WHERE amount_usd < 0 OR amount_usd > 100000; -- adjust expected range

Why: Sudden null spikes or impossible values often signal ingestion or transformation issues.

Who this is for

Product Analysts who own dashboards and experiments.
Anyone writing SQL for reports or metrics and needs reliable data.

Prerequisites

Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
Comfort with reading table schemas and identifying keys.

Learning path

Identify critical tables: orders, events, users.
Define expectations: primary keys unique, no orphaned FKs, stable daily counts.
Write checks: duplicates, nulls, ranges, referential integrity, reconciliations.
Make thresholds explicit: e.g., null_user_id < 1%.
Automate: schedule daily queries; review failures in a simple report.

Practice (mirrors the exercises below)

Exercise 1 — Detect duplicates and nulls in a primary key column

Tables: orders_raw(id, user_id, created_at, amount, status)

Write a summary query returning total rows, distinct ids, and null ids.
List duplicate ids with their counts.

Tip

Use COUNT(*), COUNT(DISTINCT id), and GROUP BY id HAVING COUNT(*) > 1.

Exercise 2 — Reconcile row counts by day between source and clean

Tables: orders_raw(id, created_at), orders_clean(order_id, order_date)

Produce daily counts from both tables and the difference.
Return only days where diff != 0.

Tip

DATE/CAST timestamps to day, then FULL OUTER JOIN and COALESCE.

Checklist before you say “done”

I can find duplicates with GROUP BY + HAVING COUNT(*) > 1.
I can reconcile daily row counts between two tables.
I can compute null rates and set thresholds.
I can detect orphaned records with LEFT JOIN ... IS NULL.
I can validate value ranges and enums.

Common mistakes and self-check

Only checking totals: Totals can hide problems. Add daily breakdowns and distinct-ID checks.
Forgetting COALESCE in reconciliations: Missing dates vanish. Always COALESCE counts to 0.
Comparing timestamps at full precision: Truncate to day when validating daily counts.
Not filtering soft-deleted rows: Apply business rules (e.g., status NOT IN ('canceled')).
Ignoring keys: Define PK/FK expectations before writing checks.

How to self-check fast

Run your checks for the last 7–30 days and scan for non-zero diffs.
Compare before/after a known pipeline change.
Validate one day manually in both source and target.

Practical projects

Daily quality dashboard: counts, distinct IDs, null rates, and diffs for top-5 tables.
Funnel integrity report: step-wise drop-offs plus duplicate and null checks per step.
Revenue sanity checks: sum(amount), negative value scan, extreme outlier detection.

Mini challenge

Write a single query to find event days where the share of null user_id is above 2% and the total events increased by over 10% vs. the 7-day average (use a simple average of the previous 7 days). Return day, null_pct, total_events, and pct_change_vs_7d_avg.

Hint

Compute daily aggregates, then join to a 7-day rolling window average using a window function on date.

Next steps

Extend checks to referential integrity (users to events, orders to users).
Add anomaly thresholds for key KPIs (e.g., conversion, AOV) alongside raw counts.
Automate: schedule queries and review a daily “green/red” status report.

Progress & test

The quick test is available to everyone. If you log in, your progress will be saved automatically.

Menu

Data Quality Checks

Table of Contents

Why this matters

Concept explained simply

Mental model

Worked examples

1) Duplicate primary keys

2) Row-count reconciliation by day

3) Null-rate and value-range checks

Who this is for

Prerequisites

Learning path

Practice (mirrors the exercises below)

Exercise 1 — Detect duplicates and nulls in a primary key column

Exercise 2 — Reconcile row counts by day between source and clean

Checklist before you say “done”

Common mistakes and self-check

Practical projects

Mini challenge

Next steps

Practice Exercises

Detect duplicates and nulls in a primary key column

Instructions

Expected Output

Reconcile row counts by day between source and clean

Data Quality Checks — Quick Test

Have questions about Data Quality Checks?

AI Assistant