luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Root Cause Analysis For Data Issues

Learn Root Cause Analysis For Data Issues for free with explanations, exercises, and a quick test (for BI Analyst).

Published: December 22, 2025 | Updated: December 22, 2025

Why this matters

As a BI Analyst, people rely on you when metrics look wrong. Root Cause Analysis (RCA) helps you go from “the number looks off” to a clear, verified answer plus a prevention plan. You’ll use RCA to:

  • Explain sudden metric changes in dashboards (spikes, drops, flatlines).
  • Uncover join duplication, stale pipelines, and broken business rules.
  • Align definitions across teams to stop recurring issues.
  • Protect trust in reports before leadership decisions are made.

Concept explained simply

RCA finds the underlying reason a data symptom exists. The symptom is what you see (e.g., revenue doubled). The root cause is the mechanism that produced it (e.g., a join change that doubled rows).

Core tools (open)
  • 5 Whys: ask “why?” repeatedly until you reach a fixable cause.
  • Fishbone thinking: check categories like Data, Process, Tools, People, Timing, and Definitions.
  • Change-first mindset: most issues follow a change (code, source, schedule, definition, upstream vendor).
  • Compare-and-contrast: where does the issue appear vs. not appear? (time, segments, sources).

Mental model

Think of a “data factory” with 6S checkpoints:

  • Signal: What exactly is wrong? Define the symptom precisely.
  • Scope: When did it start, who/what is affected, and where does it not happen?
  • Segment: Slice by time, source, country, product, environment.
  • Source: Validate at the earliest reliable source (raw logs, source tables).
  • SQL: Reproduce, isolate joins/filters/aggregations.
  • Safeguard: Fix and prevent (tests, alerts, docs).

Step-by-step RCA workflow

  1. Verify the symptom: Recompute the metric from source data. Confirm data freshness (load times, max timestamps, row counts).
  2. Scope the impact: Identify first bad time, affected dashboards, and unaffected controls.
  3. Segment to isolate: Split by dimension (country, product, channel). Look for where it breaks/works.
  4. Trace upstream: Check model layers (staging → marts → dashboards). Compare counts and key metrics at each layer.
  5. Inspect changes: Review recent changes to code, definitions, pipeline schedules, and source systems.
  6. Prove the cause: Create a minimal, reproducible query demonstrating the break and the fix.
  7. Fix and prevent: Implement the correction, add tests and monitors, and document the decision.

Worked examples

Example 1: Revenue doubled overnight

Symptom: Revenue doubled on the dashboard.

Quick checks: Data is fresh; spike starts after a release.

Trace: In the mart, orders joined to order_items at the row level.

Root cause: Join created a row explosion; revenue summed per item instead of per order.

Fix: Aggregate order_items to order_id first, or use COUNT(DISTINCT) and SUM at order granularity.

Prevent: Add tests for join multiplicity and revenue continuity.

Example 2: Daily active users (DAU) dropped 20% yesterday

Symptom: DAU down 20% for one day.

Quick checks: Max event_time is 16 hours old; ETL ran late.

Root cause: Pipeline delay caused partial day data.

Fix: Re-run the job or backfill the missing partition.

Prevent: Freshness alerts and dashboard banner when data is incomplete.

Example 3: Country shares shifted suddenly

Symptom: US share falls; “Unknown” rises.

Trace: Upstream geo mapping updated; some IPs now unclassified.

Root cause: Vendor mapping change increased “Unknown”.

Fix: Update dim_geo to new mapping; reclassify or redistribute logic.

Prevent: Monitor dimension cardinality and top values.

Example 4: Conversion rate flatlines at 0%

Symptom: Conversion 0% after deployment.

Trace: Filter changed from ‘purchase’ to ‘Purchase’ (case-sensitive).

Root cause: Case mismatch dropped all purchase events.

Fix: Normalize case in logic.

Prevent: Standardize enums and add tests for key event presence.

Diagnostic techniques and useful queries

Check data freshness
-- Replace table/column names for your stack
SELECT MAX(event_time) AS last_event_time, COUNT(*) AS rows_today
FROM fact_events
WHERE event_date = CURRENT_DATE;
Detect join duplication
-- Compare base vs joined row counts
SELECT COUNT(*) AS order_rows, COUNT(DISTINCT o.order_id) AS distinct_orders
FROM orders o
LEFT JOIN order_items i ON o.order_id = i.order_id;
Find first bad day
SELECT dt, SUM(metric) AS m
FROM mart_daily
GROUP BY dt
ORDER BY dt;
-- Look for the first date where m deviates sharply
Compare across segments
SELECT country, dt, SUM(metric) AS m
FROM mart_daily
WHERE dt BETWEEN CURRENT_DATE - 14 AND CURRENT_DATE
GROUP BY country, dt
ORDER BY dt;

Exercises

Do these to practice. The same tasks appear in the exercise panel below.

Exercise 1: Diagnose doubled order counts

Scenario: The “Sales Summary” dashboard shows order_count doubled week-over-week.

Available tables: orders(order_id, created_at, customer_id), order_items(order_id, sku, quantity), customers(customer_id, country).

  • Write diagnostic SQL to compare order counts by day using: (a) COUNT(*), (b) COUNT(DISTINCT order_id).
  • Test the join between orders and order_items. Check if row counts explode.
  • State the most likely root cause and a safe fix.
Hints
  • Start from orders only. Then add the join and compare counts.
  • If counts double only after the join, it’s a multiplicity problem.
  • Aggregate order_items to order_id before joining.
Expected outcome
  • Evidence that COUNT(*) rises but COUNT(DISTINCT order_id) does not.
  • Conclusion: join duplication inflated counts.
  • Fix: aggregate first or use distinct-safe measures.

Exercise 2: Investigate a DAU drop

Scenario: DAU dropped by ~20% yesterday on the “Engagement” dashboard.

Available table: events(user_id, event_name, event_time, event_date).

  • Check data freshness: max(event_time) for yesterday and today.
  • Compare partition counts vs the prior weekday.
  • Decide: delay vs. definition change vs. real user behavior.
Hints
  • Look at MAX(event_time) and row counts per day.
  • If yesterday has fewer rows and late max timestamp, it’s likely a delay.
  • Check recent code or source changes only if freshness looks normal.
Expected outcome
  • Identify whether the issue is partial data from a late pipeline.
  • Propose re-run/backfill and add freshness alerting.

Checklist: before declaring the root cause

  • Reproduced the metric from source data.
  • Confirmed data freshness and load status.
  • Scoped the first bad time and unaffected controls.
  • Segmented to isolate where it fails vs works.
  • Compared row counts and distinct keys across model layers.
  • Checked recent changes (code, schedule, definitions, vendors).
  • Produced a minimal query that demonstrates the break and the fix.

Common mistakes and self-checks

  • Stopping at the first plausible cause: Self-check: can you prove it with a minimal query?
  • Ignoring freshness: Self-check: show last load time and row counts for the day.
  • Confusing symptoms and causes: Self-check: ask “why” until you reach a changeable mechanism.
  • Not validating at the source layer: Self-check: compare raw vs mart values.
  • Fixing without preventing: Self-check: what test or monitor ensures it won’t recur?

Practical projects

  • Build a “Data Health” dashboard: freshness, row counts, key dimension cardinality, and anomaly flags for top metrics.
  • Create an RCA playbook: a template with steps, queries, owners, and a prevention checklist.
  • Join-safety toolkit: macros or snippets to check multiplicity, COUNT vs COUNT(DISTINCT), and pre-aggregations.

Who this is for

  • BI Analysts who own dashboards and stakeholder trust.
  • Analytics Engineers and Data Analysts supporting reporting models.

Prerequisites

  • Comfort with SQL (joins, aggregations, window functions).
  • Basic understanding of your data pipeline layers.
  • Knowledge of business metric definitions (e.g., order, DAU, conversion).

Learning path

  • Start: Freshness checks → Row count continuity → Segment comparisons.
  • Intermediate: Join safety, dimensional consistency, reproducible RCA notes.
  • Advanced: Automated monitors, anomaly detection thresholds, SLA/alerting.

Next steps

  • Apply the workflow to one live dashboard this week.
  • Add at least two monitors (freshness and join multiplicity) to a critical model.
  • Document one RCA with cause, fix, and prevention steps.

Mini challenge

A marketing dashboard shows “Leads” stable, but “Qualified Leads” drop 30% starting Monday. No outages reported. In 15 minutes, outline your RCA steps, which segments you’ll test, and one query to validate the definition.

Quick Test

Take the quick test to check your understanding. Anyone can take it for free; progress is saved only for logged-in users.

Practice Exercises

2 exercises to complete

Instructions

The “Sales Summary” dashboard shows order_count doubled week-over-week.

Tables: orders(order_id, created_at, customer_id), order_items(order_id, sku, quantity), customers(customer_id, country).

  • Write SQL to compute daily order counts from orders using COUNT(*) and COUNT(DISTINCT order_id).
  • Join orders to order_items and compare row counts before/after. Look for duplication.
  • Explain the most likely root cause and propose a safe fix.
Expected Output
Evidence that a join to order_items caused a row explosion: COUNT(*) increases but COUNT(DISTINCT order_id) does not. Proposed fix: pre-aggregate order_items to order_id or compute metrics at order-level granularity.

Root Cause Analysis For Data Issues — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Root Cause Analysis For Data Issues?

AI Assistant

Ask questions about this tool