Why this matters
As a BI Analyst, people rely on you when metrics look wrong. Root Cause Analysis (RCA) helps you go from âthe number looks offâ to a clear, verified answer plus a prevention plan. Youâll use RCA to:
- Explain sudden metric changes in dashboards (spikes, drops, flatlines).
- Uncover join duplication, stale pipelines, and broken business rules.
- Align definitions across teams to stop recurring issues.
- Protect trust in reports before leadership decisions are made.
Concept explained simply
RCA finds the underlying reason a data symptom exists. The symptom is what you see (e.g., revenue doubled). The root cause is the mechanism that produced it (e.g., a join change that doubled rows).
Core tools (open)
- 5 Whys: ask âwhy?â repeatedly until you reach a fixable cause.
- Fishbone thinking: check categories like Data, Process, Tools, People, Timing, and Definitions.
- Change-first mindset: most issues follow a change (code, source, schedule, definition, upstream vendor).
- Compare-and-contrast: where does the issue appear vs. not appear? (time, segments, sources).
Mental model
Think of a âdata factoryâ with 6S checkpoints:
- Signal: What exactly is wrong? Define the symptom precisely.
- Scope: When did it start, who/what is affected, and where does it not happen?
- Segment: Slice by time, source, country, product, environment.
- Source: Validate at the earliest reliable source (raw logs, source tables).
- SQL: Reproduce, isolate joins/filters/aggregations.
- Safeguard: Fix and prevent (tests, alerts, docs).
Step-by-step RCA workflow
- Verify the symptom: Recompute the metric from source data. Confirm data freshness (load times, max timestamps, row counts).
- Scope the impact: Identify first bad time, affected dashboards, and unaffected controls.
- Segment to isolate: Split by dimension (country, product, channel). Look for where it breaks/works.
- Trace upstream: Check model layers (staging â marts â dashboards). Compare counts and key metrics at each layer.
- Inspect changes: Review recent changes to code, definitions, pipeline schedules, and source systems.
- Prove the cause: Create a minimal, reproducible query demonstrating the break and the fix.
- Fix and prevent: Implement the correction, add tests and monitors, and document the decision.
Worked examples
Example 1: Revenue doubled overnight
Symptom: Revenue doubled on the dashboard.
Quick checks: Data is fresh; spike starts after a release.
Trace: In the mart, orders joined to order_items at the row level.
Root cause: Join created a row explosion; revenue summed per item instead of per order.
Fix: Aggregate order_items to order_id first, or use COUNT(DISTINCT) and SUM at order granularity.
Prevent: Add tests for join multiplicity and revenue continuity.
Example 2: Daily active users (DAU) dropped 20% yesterday
Symptom: DAU down 20% for one day.
Quick checks: Max event_time is 16 hours old; ETL ran late.
Root cause: Pipeline delay caused partial day data.
Fix: Re-run the job or backfill the missing partition.
Prevent: Freshness alerts and dashboard banner when data is incomplete.
Example 3: Country shares shifted suddenly
Symptom: US share falls; âUnknownâ rises.
Trace: Upstream geo mapping updated; some IPs now unclassified.
Root cause: Vendor mapping change increased âUnknownâ.
Fix: Update dim_geo to new mapping; reclassify or redistribute logic.
Prevent: Monitor dimension cardinality and top values.
Example 4: Conversion rate flatlines at 0%
Symptom: Conversion 0% after deployment.
Trace: Filter changed from âpurchaseâ to âPurchaseâ (case-sensitive).
Root cause: Case mismatch dropped all purchase events.
Fix: Normalize case in logic.
Prevent: Standardize enums and add tests for key event presence.
Diagnostic techniques and useful queries
Check data freshness
-- Replace table/column names for your stack
SELECT MAX(event_time) AS last_event_time, COUNT(*) AS rows_today
FROM fact_events
WHERE event_date = CURRENT_DATE;
Detect join duplication
-- Compare base vs joined row counts
SELECT COUNT(*) AS order_rows, COUNT(DISTINCT o.order_id) AS distinct_orders
FROM orders o
LEFT JOIN order_items i ON o.order_id = i.order_id;
Find first bad day
SELECT dt, SUM(metric) AS m
FROM mart_daily
GROUP BY dt
ORDER BY dt;
-- Look for the first date where m deviates sharply
Compare across segments
SELECT country, dt, SUM(metric) AS m
FROM mart_daily
WHERE dt BETWEEN CURRENT_DATE - 14 AND CURRENT_DATE
GROUP BY country, dt
ORDER BY dt;
Exercises
Do these to practice. The same tasks appear in the exercise panel below.
Exercise 1: Diagnose doubled order counts
Scenario: The âSales Summaryâ dashboard shows order_count doubled week-over-week.
Available tables: orders(order_id, created_at, customer_id), order_items(order_id, sku, quantity), customers(customer_id, country).
- Write diagnostic SQL to compare order counts by day using: (a) COUNT(*), (b) COUNT(DISTINCT order_id).
- Test the join between orders and order_items. Check if row counts explode.
- State the most likely root cause and a safe fix.
Hints
- Start from orders only. Then add the join and compare counts.
- If counts double only after the join, itâs a multiplicity problem.
- Aggregate order_items to order_id before joining.
Expected outcome
- Evidence that COUNT(*) rises but COUNT(DISTINCT order_id) does not.
- Conclusion: join duplication inflated counts.
- Fix: aggregate first or use distinct-safe measures.
Exercise 2: Investigate a DAU drop
Scenario: DAU dropped by ~20% yesterday on the âEngagementâ dashboard.
Available table: events(user_id, event_name, event_time, event_date).
- Check data freshness: max(event_time) for yesterday and today.
- Compare partition counts vs the prior weekday.
- Decide: delay vs. definition change vs. real user behavior.
Hints
- Look at MAX(event_time) and row counts per day.
- If yesterday has fewer rows and late max timestamp, itâs likely a delay.
- Check recent code or source changes only if freshness looks normal.
Expected outcome
- Identify whether the issue is partial data from a late pipeline.
- Propose re-run/backfill and add freshness alerting.
Checklist: before declaring the root cause
- Reproduced the metric from source data.
- Confirmed data freshness and load status.
- Scoped the first bad time and unaffected controls.
- Segmented to isolate where it fails vs works.
- Compared row counts and distinct keys across model layers.
- Checked recent changes (code, schedule, definitions, vendors).
- Produced a minimal query that demonstrates the break and the fix.
Common mistakes and self-checks
- Stopping at the first plausible cause: Self-check: can you prove it with a minimal query?
- Ignoring freshness: Self-check: show last load time and row counts for the day.
- Confusing symptoms and causes: Self-check: ask âwhyâ until you reach a changeable mechanism.
- Not validating at the source layer: Self-check: compare raw vs mart values.
- Fixing without preventing: Self-check: what test or monitor ensures it wonât recur?
Practical projects
- Build a âData Healthâ dashboard: freshness, row counts, key dimension cardinality, and anomaly flags for top metrics.
- Create an RCA playbook: a template with steps, queries, owners, and a prevention checklist.
- Join-safety toolkit: macros or snippets to check multiplicity, COUNT vs COUNT(DISTINCT), and pre-aggregations.
Who this is for
- BI Analysts who own dashboards and stakeholder trust.
- Analytics Engineers and Data Analysts supporting reporting models.
Prerequisites
- Comfort with SQL (joins, aggregations, window functions).
- Basic understanding of your data pipeline layers.
- Knowledge of business metric definitions (e.g., order, DAU, conversion).
Learning path
- Start: Freshness checks â Row count continuity â Segment comparisons.
- Intermediate: Join safety, dimensional consistency, reproducible RCA notes.
- Advanced: Automated monitors, anomaly detection thresholds, SLA/alerting.
Next steps
- Apply the workflow to one live dashboard this week.
- Add at least two monitors (freshness and join multiplicity) to a critical model.
- Document one RCA with cause, fix, and prevention steps.
Mini challenge
A marketing dashboard shows âLeadsâ stable, but âQualified Leadsâ drop 30% starting Monday. No outages reported. In 15 minutes, outline your RCA steps, which segments youâll test, and one query to validate the definition.
Quick Test
Take the quick test to check your understanding. Anyone can take it for free; progress is saved only for logged-in users.