luvv to helpDiscover the Best Free Online Tools
Topic 2 of 12

Identifying The Root Cause Candidate

Learn Identifying The Root Cause Candidate for free with explanations, exercises, and a quick test (for Business Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

As a Business Analyst, you are often asked, "Why did this metric move?" or "What caused this incident?" Identifying strong root cause candidates quickly focuses the team on the highest-value checks and experiments, saves time in incident response, and improves decision quality.

  • Real tasks: narrowing causes for a conversion drop, diagnosing onboarding friction, explaining spikes in refunds, or delays in operational processes.
  • Outcome: a clear shortlist of plausible causes with evidence to test next.

Concept explained simply

A root cause candidate is a plausible explanation for an observed problem, aligned with how the system actually works. It is not a symptom (what you see) or a solution (what you do). Your goal is to form a small, testable set of candidates that explain the effect and suggest what to check.

Mental model

  • Effect → Mechanism → Cause: start with the effect, hypothesize the mechanism, then propose a cause that could create that mechanism.
  • Tree of Whys: ask “Why?” repeatedly until you hit a changeable, specific factor that, if removed, stops the effect.
  • Coverage and timing: a good candidate matches who/where/when of the effect and aligns with system changes or events.

A crisp method (5 steps)

Step 1. Define the problem precisely
  • Metric and size: what changed and by how much?
  • Who/where/when: segments, platforms, geos, time window.
  • Boundaries: what did not change? (unchanged segments are powerful clues)
Step 2. Map the system
  • List key components, data flows, actors, and process steps touched by the metric.
  • Note recent releases, config changes, vendor updates, traffic mix shifts, policy changes.
Step 3. Generate candidates with structure
  • Use fishbone buckets: People, Process, Technology, Data, Policy/External.
  • For each bucket, ask: what changed that fits the mechanism?
Step 4. Prioritize quickly (filters)
  • Temporal match: did it change right before the effect?
  • Coverage match: does the candidate affect the segments that moved (and not others)?
  • Plausible mechanism: can you explain how it drives the metric?
  • Disconfirmers: is there evidence that contradicts it?
Step 5. Form testable hypotheses
  • Pattern: “If cause C is true, then we expect signals E1, E2 …”
  • List the quickest checks: logs, segment splits, rollbacks, A/B guardrails, sampling real user sessions.
Quick scoring idea

Score each candidate 0–2 on Temporal match, Coverage, Mechanism clarity, and Disconfirmers (reverse). Prioritize the highest total.

Worked examples

Example 1: Checkout rate dropped from 3.1% to 2.2% after a release

  • Effect: 29% relative drop; mostly mobile; started 1 hour post-release; APAC hit harder; desktop flat.
  • Candidates:
    • Payment gateway timeout on mobile SDK (Technology). Expected signals: higher timeout errors, longer API p95 latency in mobile; APAC more due to route.
    • Address validation bug for certain postal formats (Data/Technology). Expected signals: spikes in validation errors in APAC; retries; form abandonment at address step.
    • Traffic mix shift to low-intent channel (External). Expected signals: referrer change; higher bounce early in funnel, not at payment step.
    • Promo code rule conflict (Process/Data). Expected signals: error on promo apply; removal of discount lines.
Reasoning

Mobile-heavy impact and APAC skew point to client-side or regional infrastructure. Validation formats and mobile SDK timeouts both fit timing and coverage. Traffic-mix would affect earlier funnel; desktop flat weakens that. Prioritize: 1) SDK timeout, 2) validation bug, 3) promo conflict.

Example 2: Surge in "can’t log in" tickets after MFA rollout

  • Effect: Tickets +180%; Android > iOS; evenings peak; started day of rollout.
  • Candidates:
    • SMS provider delay/blocks (External/Technology). Signals: SMS delivery rate drop; longer delivery latency; retries.
    • Rate limiting too strict (Technology/Policy). Signals: 429 errors on token endpoint; clustered on Android SDK versions.
    • Session cache invalidation (Technology). Signals: frequent forced logouts; token mismatch errors after password reset.
Quick checks
  • Compare code path errors pre/post rollout.
  • Segment by OS/SDK; check SMS vendor dashboard delivery/latency by country.
  • Sample session logs for 429/401 bursts.

Example 3: 20% of warehouse orders ship late

  • Effect: Late shipments concentrated in Zone C; mornings unaffected; started after route optimization update.
  • Candidates:
    • Picker route change increased walking distance (Process). Signals: pick time per order up in Zone C; step counts up.
    • Label printer failure in Zone C (Technology). Signals: reprint rate spike; queue backlog times.
    • Carrier pickup time moved earlier (External/Policy). Signals: handoff deadline moved; late-day jobs pile up.
Reasoning

Zone C and post-update timing suggest process or local tech. Check pick-time distributions and printer errors first; then verify carrier schedule change.

Quick checks and heuristics

  • Start with the change log: most issues follow a change.
  • Slice the metric: if only Android moved, backend-only causes are less likely.
  • Follow the user path: where do drop-offs cluster?
  • Look for “didn’t change” segments to rule out candidates.
  • Prefer candidates with a clear, testable mechanism over vague ones.

Common mistakes and self-check

  • Mistake: Jumping to solutions. Self-check: Did you list candidates and expected signals before proposing a fix?
  • Mistake: Confusing symptom with cause. Self-check: If we remove this, does the problem stop?
  • Mistake: Ignoring negating evidence. Self-check: What would disconfirm your top candidate?
  • Mistake: Overfitting to anecdotes. Self-check: Does the candidate explain all affected segments and exclude the unaffected ones?
  • Mistake: Too many candidates with no ranking. Self-check: Have you prioritized by timing, coverage, and mechanism?

Exercises

These are available to everyone. Sign in to save your progress.

Exercise 1: Trial-to-paid conversion drop after UI refresh

See details in the Exercises section below (Ex1). Produce 3–5 candidates, rank the top 3, and list 2–3 quick checks per candidate.

Exercise 2: Nightly ETL delay

See details in the Exercises section below (Ex2). Identify immediate checks, top candidates, and write a testable hypothesis for your #1.

  • Checklist before you submit:
    • Problem precisely defined (metric, who/where/when)
    • At least 3 candidates across different buckets
    • Ranking justified with timing and coverage
    • Each top candidate has measurable expected signals

Mini challenge

Your marketing dashboard shows a sudden drop in attributed revenue from Paid Social, but overall site revenue is flat. Draft two root cause candidates and one testable hypothesis for your top pick. Use: “If C is true, we expect E1/E2.”

One possible approach
  • Candidate A: UTM parsing broke for Facebook click IDs (Technology/Data). Expect: zero fbclid in logs; jump in "Direct" traffic; no change in checkout volume.
  • Candidate B: Channel mapping rule changed (Process). Expect: spike in “Unassigned” channel; mapping commit in last 24h.
  • Hypothesis: If UTM parsing broke, then fbclid presence drops >90% post-deploy and Paid Social revenue reappears as Direct.

Practical projects

  • Create a “Root Cause Playbook” for one product area: include top metrics, system map, common failure modes, and go-to checks.
  • Run a mock incident drill: pick a historical metric drop and timebox 60 minutes to identify, rank, and test candidates; document outcomes.
  • Build a candidate scoring sheet: columns for Timing, Coverage, Mechanism, Disconfirmers. Use it on two recent issues and compare speed/accuracy.

Who this is for

  • Business Analysts, Product Analysts, and Operations Analysts who explain metric movements and incidents.

Prerequisites

  • Basic familiarity with product/process metrics and segmentation.
  • Ability to read simple logs or dashboards.
  • Comfort describing systems at a high level.

Learning path

  • Before: Problem definition and metric literacy.
  • Now: Identifying strong root cause candidates.
  • Next: Testing hypotheses and validating fixes.

Next steps

  • Do the exercises and take the Quick Test below. Anyone can take it; sign in to save progress.
  • Apply the 5-step method to a live issue this week and log candidates, evidence, and results.

Practice Exercises

2 exercises to complete

Instructions

Scenario: Yesterday, a UI refresh went live. Trial-to-paid conversion fell from 21% to 16% (24% relative). The drop is concentrated on mobile web, strongest on iOS Safari. Desktop is flat. The drop starts within an hour after the release and is larger for users with promo codes. Your task:

  • Define the problem precisely (metric, who/where/when/size; note what didn’t change).
  • List 3–5 root cause candidates across different buckets (People, Process, Technology, Data, Policy/External).
  • Rank your top 3 candidates with brief rationale using timing, coverage, and mechanism.
  • For each of the top 3, list 2–3 quick checks or expected signals.
Expected Output
A short problem definition; a list of 3–5 plausible candidates; a ranked top-3 with rationale; 2–3 measurable checks per top candidate.

Identifying The Root Cause Candidate — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Identifying The Root Cause Candidate?

AI Assistant

Ask questions about this tool