Why this matters
As a Product Analyst, your charts, experiments, and business decisions are only as good as the data behind them. Data quality checks help you catch issues early: missing rows after a pipeline change, unexpected spikes of nulls, duplicate events inflating metrics, or broken joins causing orphaned records.
- Daily dashboards: Confirm counts and key metrics haven’t silently drifted.
- Experiment analysis: Ensure assignment and events are complete and deduplicated.
- Revenue reporting: Validate totals, currencies, and outliers before leadership sees them.
Real-world examples you’ll face
- New ETL step halves yesterday’s orders — detect with row-count reconciliation.
- Client-side bug doubles event sends — catch via duplicate-key checks.
- Tracking change introduces null user_id — spot via null-rate thresholds.
Concept explained simply
Data quality checks are short SQL queries that verify assumptions about your data: uniqueness, completeness, validity, and consistency. Think of them as small unit tests for your warehouse tables.
Mental model
- Uniqueness: “Each primary key appears once.”
- Completeness: “We didn’t lose rows vs. source.”
- Validity: “Values are in expected ranges/enums.”
- Consistency: “Foreign keys match parent tables.”
Pro tip
Prefer stable checks (counts, distinct IDs, null rates) over fragile ones. Start small and automate the most business-critical checks first.
Worked examples
1) Duplicate primary keys
-- Find duplicate order IDs
SELECT id AS duplicate_id, COUNT(*) AS cnt
FROM orders_raw
GROUP BY id
HAVING COUNT(*) > 1;
Why: Duplicates can inflate metrics. This query lists any IDs appearing more than once.
2) Row-count reconciliation by day
WITH raw_daily AS (
SELECT CAST(created_at AS DATE) AS d, COUNT(*) AS raw_count
FROM orders_raw
GROUP BY 1
), clean_daily AS (
SELECT order_date AS d, COUNT(*) AS clean_count
FROM orders_clean
GROUP BY 1
)
SELECT COALESCE(r.d, c.d) AS d,
COALESCE(r.raw_count, 0) AS raw_count,
COALESCE(c.clean_count, 0) AS clean_count,
COALESCE(c.clean_count, 0) - COALESCE(r.raw_count, 0) AS diff
FROM raw_daily r
FULL OUTER JOIN clean_daily c USING (d)
ORDER BY d;
Why: Ensures your clean table didn’t lose or gain rows compared to the source.
3) Null-rate and value-range checks
-- Null-rate check for user_id
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS null_user_id,
100.0 * SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*),0) AS null_pct
FROM events;
-- Value range check for order amount
SELECT COUNT(*) AS out_of_range
FROM orders_clean
WHERE amount_usd < 0 OR amount_usd > 100000; -- adjust expected range
Why: Sudden null spikes or impossible values often signal ingestion or transformation issues.
Who this is for
- Product Analysts who own dashboards and experiments.
- Anyone writing SQL for reports or metrics and needs reliable data.
Prerequisites
- Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
- Comfort with reading table schemas and identifying keys.
Learning path
- Identify critical tables: orders, events, users.
- Define expectations: primary keys unique, no orphaned FKs, stable daily counts.
- Write checks: duplicates, nulls, ranges, referential integrity, reconciliations.
- Make thresholds explicit: e.g., null_user_id < 1%.
- Automate: schedule daily queries; review failures in a simple report.
Practice (mirrors the exercises below)
Exercise 1 — Detect duplicates and nulls in a primary key column
Tables: orders_raw(id, user_id, created_at, amount, status)
- Write a summary query returning total rows, distinct ids, and null ids.
- List duplicate ids with their counts.
Tip
Use COUNT(*), COUNT(DISTINCT id), and GROUP BY id HAVING COUNT(*) > 1.
Exercise 2 — Reconcile row counts by day between source and clean
Tables: orders_raw(id, created_at), orders_clean(order_id, order_date)
- Produce daily counts from both tables and the difference.
- Return only days where diff != 0.
Tip
DATE/CAST timestamps to day, then FULL OUTER JOIN and COALESCE.
Checklist before you say “done”
- I can find duplicates with GROUP BY + HAVING COUNT(*) > 1.
- I can reconcile daily row counts between two tables.
- I can compute null rates and set thresholds.
- I can detect orphaned records with LEFT JOIN ... IS NULL.
- I can validate value ranges and enums.
Common mistakes and self-check
- Only checking totals: Totals can hide problems. Add daily breakdowns and distinct-ID checks.
- Forgetting COALESCE in reconciliations: Missing dates vanish. Always COALESCE counts to 0.
- Comparing timestamps at full precision: Truncate to day when validating daily counts.
- Not filtering soft-deleted rows: Apply business rules (e.g., status NOT IN ('canceled')).
- Ignoring keys: Define PK/FK expectations before writing checks.
How to self-check fast
- Run your checks for the last 7–30 days and scan for non-zero diffs.
- Compare before/after a known pipeline change.
- Validate one day manually in both source and target.
Practical projects
- Daily quality dashboard: counts, distinct IDs, null rates, and diffs for top-5 tables.
- Funnel integrity report: step-wise drop-offs plus duplicate and null checks per step.
- Revenue sanity checks: sum(amount), negative value scan, extreme outlier detection.
Mini challenge
Write a single query to find event days where the share of null user_id is above 2% and the total events increased by over 10% vs. the 7-day average (use a simple average of the previous 7 days). Return day, null_pct, total_events, and pct_change_vs_7d_avg.
Hint
Compute daily aggregates, then join to a 7-day rolling window average using a window function on date.
Next steps
- Extend checks to referential integrity (users to events, orders to users).
- Add anomaly thresholds for key KPIs (e.g., conversion, AOV) alongside raw counts.
- Automate: schedule queries and review a daily “green/red” status report.
Progress & test
The quick test is available to everyone. If you log in, your progress will be saved automatically.