Why Data Quality Checks matter for BI Analysts
- Catch issues early: identify missing events, stale data, broken joins, and unexpected nulls.
- Explain variance: reconcile metrics with source systems to pinpoint where differences arise.
- Safeguard consistency: keep the same metric equal across dashboards and teams.
- Scale responsibly: automate checks and alerts as data volume and scope grow.
Who this is for
- Aspiring and current BI Analysts who build or maintain dashboards and reports.
- Data-savvy PMs or Operations Analysts who own KPIs and need reliable numbers.
- Analytics Engineers who partner with BI to harden metric pipelines.
Prerequisites
- SQL basics: SELECT, WHERE, GROUP BY, JOIN, window functions (ROW_NUMBER, COUNT OVER).
- Familiarity with your warehouse and BI tool (e.g., dashboards, data refresh settings).
- Basic statistics: median vs mean, percentiles, outliers.
Learning path (practical roadmap)
Milestone 1 — Map critical metrics and sources
List your top 5 business metrics (e.g., Orders, Revenue, Active Users). For each metric, note the source system, ingestion path, warehouse tables, and dashboard owners.
- Deliverable: a one-page metric-to-source map.
- Mini task: identify the primary keys and time columns per metric.
Milestone 2 — Profile and baseline
Establish normal behavior so you can detect anomalies later.
- Compute daily row counts, null rates for important columns, and unique-key coverage.
- Deliverable: a “baseline sheet” with typical ranges (e.g., weekday vs weekend).
Milestone 3 — Freshness and completeness
Ensure data arrives on time and fully.
- Freshness: compare max event time with current time or schedule time.
- Completeness: compare counts vs recent average or upstream logs.
Milestone 4 — Duplicates and outliers
Find and handle duplicates, and detect outliers without over-alerting.
- Use window functions for duplicates by business key + time.
- Use percentiles/IQR for robust outlier checks.
Milestone 5 — Join integrity and null monitoring
Identify broken joins and missing dimension coverage.
- Monitor null rates for foreign keys and important attributes.
- Check join cardinalities and orphaned records.
Milestone 6 — Reconcile across systems and reports
Compare warehouse facts to source-of-truth, and keep dashboards aligned.
- Drill by date, product, channel to localize differences.
- Escalate with clear evidence and reproducible queries.
Milestone 7 — Automate checks and alerts
Start with the essentials; tune thresholds to reduce noise.
- Create scheduled queries that write check results to a table.
- Set conditional alerts: freshness delay, null spikes, count deltas.
Milestone 8 — Root cause analysis and documentation
Build a repeatable incident flow and document known data limitations.
- RCA checklist: scope, reproduce, isolate layer (source, ETL, model, viz), fix, prevent.
- Write a “data caveats” section per metric.
Worked examples
Example 1 — Metric reconciliation with source
Goal: Compare daily revenue between source and warehouse to localize differences.
-- Warehouse metric (fact_orders)
WITH wh AS (
SELECT order_date::date AS d, SUM(total_amount) AS wh_rev
FROM fact_orders
GROUP BY 1
),
-- Source metric (raw.sales_orders)
src AS (
SELECT created_at::date AS d, SUM(amount) AS src_rev
FROM raw.sales_orders
GROUP BY 1
)
SELECT COALESCE(wh.d, src.d) AS date,
wh.wh_rev,
src.src_rev,
(wh.wh_rev - src.src_rev) AS diff
FROM wh
FULL OUTER JOIN src USING(d)
ORDER BY date;
How to use
If diff spikes on a specific date, drill by channel, product, or currency to isolate the scope. Check for late-arriving updates or currency conversion differences.
Example 2 — Event completeness and freshness
Goal: Ensure events are up-to-date and daily counts look normal.
-- Freshness: latest event timestamp
SELECT MAX(event_time) AS latest_event_time
FROM fact_events;
-- Completeness: compare today vs rolling 7-day avg (same weekday)
WITH d AS (
SELECT event_time::date AS d, COUNT(*) AS cnt
FROM fact_events
GROUP BY 1
),
base AS (
SELECT d,
cnt,
AVG(cnt) OVER (
PARTITION BY EXTRACT(DOW FROM d)
ORDER BY d ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) AS dow_avg
FROM d
)
SELECT * FROM base WHERE d = CURRENT_DATE;
How to use
Alert if latest_event_time is beyond your acceptable delay (e.g., > 2 hours). Alert if today’s cnt is < 70% of dow_avg after the expected load time.
Example 3 — Detecting duplicates
Goal: Identify duplicate events by (event_id) or by a composite key.
-- Find duplicate event_ids
WITH marked AS (
SELECT event_id,
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_time) AS rn
FROM fact_events
)
SELECT * FROM marked WHERE rn > 1;
-- For composite key (user_id, session_id, event_name, event_time::date)
WITH marked AS (
SELECT user_id, session_id, event_name, event_time::date AS d,
ROW_NUMBER() OVER (
PARTITION BY user_id, session_id, event_name, event_time::date
ORDER BY event_time
) AS rn
FROM fact_events
)
SELECT * FROM marked WHERE rn > 1;
How to use
Decide the business key that should be unique. If duplicates arise from replays, deduplicate at the model stage using rn = 1, and coordinate with ingestion teams.
Example 4 — Null rates and broken joins
Goal: Monitor missing dimension coverage and join issues.
-- Foreign key null rate over time
SELECT order_date::date AS d,
COUNT(*) AS rows,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS nulls,
100.0 * SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS null_rate_pct
FROM fact_orders
GROUP BY 1
ORDER BY 1;
-- Broken join check: missing dimension keys
WITH lefted AS (
SELECT f.order_id, f.customer_id, d.customer_id AS dim_key
FROM fact_orders f
LEFT JOIN dim_customer d ON f.customer_id = d.customer_id
)
SELECT COUNT(*) AS missing_customers
FROM lefted
WHERE dim_key IS NULL AND customer_id IS NOT NULL;
How to use
Alert on sudden spikes in null_rate_pct. Investigate: Did dimension keys change? Did the upstream stop populating customer_id?
Example 5 — Outlier detection with percentiles
Goal: Flag unusually large order amounts using robust thresholds.
-- Percentile-based cutoffs (requires percentile_cont or equivalent)
WITH stats AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) AS p25,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) AS p75
FROM fact_orders
),
calc AS (
SELECT f.*, s.p25, s.p75,
(s.p75 - s.p25) AS iqr
FROM fact_orders f CROSS JOIN stats s
)
SELECT *
FROM calc
WHERE total_amount > (p75 + 1.5 * iqr)
OR total_amount < (p25 - 1.5 * iqr);
Notes
If your warehouse lacks percentile_cont, use its equivalent (e.g., APPROX_PERCENTILE, QUALIFY WITH NTILE buckets). Tune multiplier per business tolerance.
Example 6 — Cross-report consistency
Goal: Ensure a metric matches across two curated datasets feeding different dashboards.
WITH a AS (
SELECT order_date::date AS d, SUM(total_amount) AS rev_a
FROM mart_finance_revenue
GROUP BY 1
), b AS (
SELECT order_date::date AS d, SUM(total_amount) AS rev_b
FROM mart_ops_revenue
GROUP BY 1
)
SELECT COALESCE(a.d, b.d) AS date,
rev_a, rev_b,
(rev_a - rev_b) AS delta
FROM a FULL OUTER JOIN b USING(d)
ORDER BY date;
How to use
If delta appears only for certain channels, align business rules (refund handling, FX rates, late corrections) and document the agreed definition.
Drills and micro-exercises
- Write a query to compute daily null rates for two key columns in your main fact table.
- Find duplicate rows by your business key and produce a deduped view with ROW_NUMBER.
- Create a daily count trend for the last 30 days and annotate typical weekday/weekend ranges.
- Compare two versions of the same metric and produce a table of deltas by date and channel.
- Compute percentile cutoffs for a metric and list top 50 outliers.
- Build a small checks table (date, check_name, status, value, threshold, note) and insert results from two checks.
Common mistakes and debugging tips
Mistake: Alerting on raw variance without context
Tip: Compare vs rolling averages or same-weekday baselines. Include expected load windows before alerting.
Mistake: Ignoring surrogate key changes
Tip: Track business identifiers (e.g., external_id) and ensure consistent joins across slowly changing dimensions.
Mistake: Overreliance on mean + 3σ for skewed data
Tip: Use percentiles or IQR for heavy-tailed distributions to reduce false alarms.
Mistake: One-off fixes without documenting
Tip: Add a “Known Data Limitations” note with the scope, impact, and workaround. Keep it close to the metric definition.
Mistake: Not localizing discrepancies
Tip: Slice by date, product, and channel to pinpoint the segment causing gaps before escalating.
Mini project: Data Trust Pack for the Sales Funnel
- Define funnel metrics: sessions, adds-to-cart, checkouts, purchases.
- Create baseline queries: daily counts and null rates for key IDs (user_id, session_id, order_id).
- Freshness check: latest event_time and acceptable delay per table.
- Completeness: today’s counts vs rolling 7-day same-weekday baseline.
- Duplicates: find duplicates by (order_id) and dedupe view.
- Outliers: flag abnormally high order_amount using IQR.
- Join integrity: percent of orders missing customers in dim_customer.
- Reconciliation: compare purchase count with source system per day.
- Automation: write check outputs into a checks table and mark pass/fail.
- Documentation: a one-pager listing caveats and what each alert means.
Success criteria
- At least 7 checks implemented with clear thresholds.
- One page of caveats and owner contacts.
- Evidence of a simulated incident and a short RCA note.
Practical projects
- Marketing Pipeline Guardrails: build completeness and freshness checks for ad spend, clicks, and attributed conversions. Include a weekly reconciliation vs ad platform totals.
- Finance Revenue Assurance: validate gross vs net revenue across refund logic; document currency conversion assumptions and test edge cases.
- Product Events Quality Board: track event schema changes, required property coverage, and session stitching health; create an alert for sudden property null spikes.
Subskills
Master these focused subskills to solidify your data quality practice:
- Metric Reconciliation With Source Systems — Confirm warehouse metrics match the system of record; localize gaps by date/channel.
- Validating Event Completeness And Freshness — Ensure data arrives on time and fully; baseline expected volumes.
- Checking Duplicates And Outliers — Keep unique records clean and flag abnormal values safely.
- Monitoring Null Rates And Broken Joins — Surface missing keys and orphaned rows early.
- Data Consistency Across Reports — Align definitions and verify equal numbers across dashboards.
- Automated Checks And Alerts Basics — Schedule queries and trigger targeted, low-noise alerts.
- Root Cause Analysis For Data Issues — Reproduce, isolate, fix, and prevent incidents methodically.
- Documenting Known Data Limitations — Communicate caveats, scope, and workarounds to stakeholders.
Next steps
- Practice: implement two checks in your environment and review results after a full day of data.
- Take the skill exam below. Everyone can take it for free; if you log in, your progress is saved.
- Pick one practical project and deliver a concise readme summarizing your checks and thresholds.