How to learn Root Cause Analysis For Data Issues for Data Quality Checks in BI Analyst for free

Why this matters

As a BI Analyst, people rely on you when metrics look wrong. Root Cause Analysis (RCA) helps you go from “the number looks off” to a clear, verified answer plus a prevention plan. You’ll use RCA to:

Explain sudden metric changes in dashboards (spikes, drops, flatlines).
Uncover join duplication, stale pipelines, and broken business rules.
Align definitions across teams to stop recurring issues.
Protect trust in reports before leadership decisions are made.

Concept explained simply

RCA finds the underlying reason a data symptom exists. The symptom is what you see (e.g., revenue doubled). The root cause is the mechanism that produced it (e.g., a join change that doubled rows).

Core tools (open)

5 Whys: ask “why?” repeatedly until you reach a fixable cause.
Fishbone thinking: check categories like Data, Process, Tools, People, Timing, and Definitions.
Change-first mindset: most issues follow a change (code, source, schedule, definition, upstream vendor).
Compare-and-contrast: where does the issue appear vs. not appear? (time, segments, sources).

Mental model

Think of a “data factory” with 6S checkpoints:

Signal: What exactly is wrong? Define the symptom precisely.
Scope: When did it start, who/what is affected, and where does it not happen?
Segment: Slice by time, source, country, product, environment.
Source: Validate at the earliest reliable source (raw logs, source tables).
SQL: Reproduce, isolate joins/filters/aggregations.
Safeguard: Fix and prevent (tests, alerts, docs).

Step-by-step RCA workflow

Verify the symptom: Recompute the metric from source data. Confirm data freshness (load times, max timestamps, row counts).
Scope the impact: Identify first bad time, affected dashboards, and unaffected controls.
Segment to isolate: Split by dimension (country, product, channel). Look for where it breaks/works.
Trace upstream: Check model layers (staging → marts → dashboards). Compare counts and key metrics at each layer.
Inspect changes: Review recent changes to code, definitions, pipeline schedules, and source systems.
Prove the cause: Create a minimal, reproducible query demonstrating the break and the fix.
Fix and prevent: Implement the correction, add tests and monitors, and document the decision.

Worked examples

Example 1: Revenue doubled overnight

Symptom: Revenue doubled on the dashboard.

Quick checks: Data is fresh; spike starts after a release.

Trace: In the mart, orders joined to order_items at the row level.

Root cause: Join created a row explosion; revenue summed per item instead of per order.

Fix: Aggregate order_items to order_id first, or use COUNT(DISTINCT) and SUM at order granularity.

Prevent: Add tests for join multiplicity and revenue continuity.

Example 2: Daily active users (DAU) dropped 20% yesterday

Symptom: DAU down 20% for one day.

Quick checks: Max event_time is 16 hours old; ETL ran late.

Root cause: Pipeline delay caused partial day data.

Fix: Re-run the job or backfill the missing partition.

Prevent: Freshness alerts and dashboard banner when data is incomplete.

Example 3: Country shares shifted suddenly

Symptom: US share falls; “Unknown” rises.

Trace: Upstream geo mapping updated; some IPs now unclassified.

Root cause: Vendor mapping change increased “Unknown”.

Fix: Update dim_geo to new mapping; reclassify or redistribute logic.

Prevent: Monitor dimension cardinality and top values.

Example 4: Conversion rate flatlines at 0%

Symptom: Conversion 0% after deployment.

Trace: Filter changed from ‘purchase’ to ‘Purchase’ (case-sensitive).

Root cause: Case mismatch dropped all purchase events.

Fix: Normalize case in logic.

Prevent: Standardize enums and add tests for key event presence.

Diagnostic techniques and useful queries

Check data freshness

-- Replace table/column names for your stack
SELECT MAX(event_time) AS last_event_time, COUNT(*) AS rows_today
FROM fact_events
WHERE event_date = CURRENT_DATE;

Detect join duplication

-- Compare base vs joined row counts
SELECT COUNT(*) AS order_rows, COUNT(DISTINCT o.order_id) AS distinct_orders
FROM orders o
LEFT JOIN order_items i ON o.order_id = i.order_id;

Find first bad day

SELECT dt, SUM(metric) AS m
FROM mart_daily
GROUP BY dt
ORDER BY dt;
-- Look for the first date where m deviates sharply

Compare across segments

SELECT country, dt, SUM(metric) AS m
FROM mart_daily
WHERE dt BETWEEN CURRENT_DATE - 14 AND CURRENT_DATE
GROUP BY country, dt
ORDER BY dt;

Exercises

Do these to practice. The same tasks appear in the exercise panel below.

Exercise 1: Diagnose doubled order counts

Scenario: The “Sales Summary” dashboard shows order_count doubled week-over-week.

Available tables: orders(order_id, created_at, customer_id), order_items(order_id, sku, quantity), customers(customer_id, country).

Write diagnostic SQL to compare order counts by day using: (a) COUNT(*), (b) COUNT(DISTINCT order_id).
Test the join between orders and order_items. Check if row counts explode.
State the most likely root cause and a safe fix.

Hints

Start from orders only. Then add the join and compare counts.
If counts double only after the join, it’s a multiplicity problem.
Aggregate order_items to order_id before joining.

Expected outcome

Evidence that COUNT(*) rises but COUNT(DISTINCT order_id) does not.
Conclusion: join duplication inflated counts.
Fix: aggregate first or use distinct-safe measures.

Exercise 2: Investigate a DAU drop

Scenario: DAU dropped by ~20% yesterday on the “Engagement” dashboard.

Available table: events(user_id, event_name, event_time, event_date).

Check data freshness: max(event_time) for yesterday and today.
Compare partition counts vs the prior weekday.
Decide: delay vs. definition change vs. real user behavior.

Hints

Look at MAX(event_time) and row counts per day.
If yesterday has fewer rows and late max timestamp, it’s likely a delay.
Check recent code or source changes only if freshness looks normal.

Expected outcome

Identify whether the issue is partial data from a late pipeline.
Propose re-run/backfill and add freshness alerting.

Checklist: before declaring the root cause

Reproduced the metric from source data.
Confirmed data freshness and load status.
Scoped the first bad time and unaffected controls.
Segmented to isolate where it fails vs works.
Compared row counts and distinct keys across model layers.
Checked recent changes (code, schedule, definitions, vendors).
Produced a minimal query that demonstrates the break and the fix.

Common mistakes and self-checks

Stopping at the first plausible cause: Self-check: can you prove it with a minimal query?
Ignoring freshness: Self-check: show last load time and row counts for the day.
Confusing symptoms and causes: Self-check: ask “why” until you reach a changeable mechanism.
Not validating at the source layer: Self-check: compare raw vs mart values.
Fixing without preventing: Self-check: what test or monitor ensures it won’t recur?

Practical projects

Build a “Data Health” dashboard: freshness, row counts, key dimension cardinality, and anomaly flags for top metrics.
Create an RCA playbook: a template with steps, queries, owners, and a prevention checklist.
Join-safety toolkit: macros or snippets to check multiplicity, COUNT vs COUNT(DISTINCT), and pre-aggregations.

Who this is for

BI Analysts who own dashboards and stakeholder trust.
Analytics Engineers and Data Analysts supporting reporting models.

Prerequisites

Comfort with SQL (joins, aggregations, window functions).
Basic understanding of your data pipeline layers.
Knowledge of business metric definitions (e.g., order, DAU, conversion).

Learning path

Start: Freshness checks → Row count continuity → Segment comparisons.
Intermediate: Join safety, dimensional consistency, reproducible RCA notes.
Advanced: Automated monitors, anomaly detection thresholds, SLA/alerting.

Next steps

Apply the workflow to one live dashboard this week.
Add at least two monitors (freshness and join multiplicity) to a critical model.
Document one RCA with cause, fix, and prevention steps.

Mini challenge

A marketing dashboard shows “Leads” stable, but “Qualified Leads” drop 30% starting Monday. No outages reported. In 15 minutes, outline your RCA steps, which segments you’ll test, and one query to validate the definition.

Quick Test

Take the quick test to check your understanding. Anyone can take it for free; progress is saved only for logged-in users.

Menu

Root Cause Analysis For Data Issues

Table of Contents

Why this matters

Concept explained simply

Mental model

Step-by-step RCA workflow

Worked examples

Diagnostic techniques and useful queries

Exercises

Exercise 1: Diagnose doubled order counts

Exercise 2: Investigate a DAU drop

Checklist: before declaring the root cause

Common mistakes and self-checks

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Diagnose doubled order counts

Instructions

Expected Output

Investigate a DAU drop

Root Cause Analysis For Data Issues — Quick Test

Have questions about Root Cause Analysis For Data Issues?

AI Assistant