Why this matters
As a Business Analyst, you are often asked, "Why did this metric move?" or "What caused this incident?" Identifying strong root cause candidates quickly focuses the team on the highest-value checks and experiments, saves time in incident response, and improves decision quality.
- Real tasks: narrowing causes for a conversion drop, diagnosing onboarding friction, explaining spikes in refunds, or delays in operational processes.
- Outcome: a clear shortlist of plausible causes with evidence to test next.
Concept explained simply
A root cause candidate is a plausible explanation for an observed problem, aligned with how the system actually works. It is not a symptom (what you see) or a solution (what you do). Your goal is to form a small, testable set of candidates that explain the effect and suggest what to check.
Mental model
- Effect → Mechanism → Cause: start with the effect, hypothesize the mechanism, then propose a cause that could create that mechanism.
- Tree of Whys: ask “Why?” repeatedly until you hit a changeable, specific factor that, if removed, stops the effect.
- Coverage and timing: a good candidate matches who/where/when of the effect and aligns with system changes or events.
A crisp method (5 steps)
- Metric and size: what changed and by how much?
- Who/where/when: segments, platforms, geos, time window.
- Boundaries: what did not change? (unchanged segments are powerful clues)
- List key components, data flows, actors, and process steps touched by the metric.
- Note recent releases, config changes, vendor updates, traffic mix shifts, policy changes.
- Use fishbone buckets: People, Process, Technology, Data, Policy/External.
- For each bucket, ask: what changed that fits the mechanism?
- Temporal match: did it change right before the effect?
- Coverage match: does the candidate affect the segments that moved (and not others)?
- Plausible mechanism: can you explain how it drives the metric?
- Disconfirmers: is there evidence that contradicts it?
- Pattern: “If cause C is true, then we expect signals E1, E2 …”
- List the quickest checks: logs, segment splits, rollbacks, A/B guardrails, sampling real user sessions.
Quick scoring idea
Score each candidate 0–2 on Temporal match, Coverage, Mechanism clarity, and Disconfirmers (reverse). Prioritize the highest total.
Worked examples
Example 1: Checkout rate dropped from 3.1% to 2.2% after a release
- Effect: 29% relative drop; mostly mobile; started 1 hour post-release; APAC hit harder; desktop flat.
- Candidates:
- Payment gateway timeout on mobile SDK (Technology). Expected signals: higher timeout errors, longer API p95 latency in mobile; APAC more due to route.
- Address validation bug for certain postal formats (Data/Technology). Expected signals: spikes in validation errors in APAC; retries; form abandonment at address step.
- Traffic mix shift to low-intent channel (External). Expected signals: referrer change; higher bounce early in funnel, not at payment step.
- Promo code rule conflict (Process/Data). Expected signals: error on promo apply; removal of discount lines.
Reasoning
Mobile-heavy impact and APAC skew point to client-side or regional infrastructure. Validation formats and mobile SDK timeouts both fit timing and coverage. Traffic-mix would affect earlier funnel; desktop flat weakens that. Prioritize: 1) SDK timeout, 2) validation bug, 3) promo conflict.
Example 2: Surge in "can’t log in" tickets after MFA rollout
- Effect: Tickets +180%; Android > iOS; evenings peak; started day of rollout.
- Candidates:
- SMS provider delay/blocks (External/Technology). Signals: SMS delivery rate drop; longer delivery latency; retries.
- Rate limiting too strict (Technology/Policy). Signals: 429 errors on token endpoint; clustered on Android SDK versions.
- Session cache invalidation (Technology). Signals: frequent forced logouts; token mismatch errors after password reset.
Quick checks
- Compare code path errors pre/post rollout.
- Segment by OS/SDK; check SMS vendor dashboard delivery/latency by country.
- Sample session logs for 429/401 bursts.
Example 3: 20% of warehouse orders ship late
- Effect: Late shipments concentrated in Zone C; mornings unaffected; started after route optimization update.
- Candidates:
- Picker route change increased walking distance (Process). Signals: pick time per order up in Zone C; step counts up.
- Label printer failure in Zone C (Technology). Signals: reprint rate spike; queue backlog times.
- Carrier pickup time moved earlier (External/Policy). Signals: handoff deadline moved; late-day jobs pile up.
Reasoning
Zone C and post-update timing suggest process or local tech. Check pick-time distributions and printer errors first; then verify carrier schedule change.
Quick checks and heuristics
- Start with the change log: most issues follow a change.
- Slice the metric: if only Android moved, backend-only causes are less likely.
- Follow the user path: where do drop-offs cluster?
- Look for “didn’t change” segments to rule out candidates.
- Prefer candidates with a clear, testable mechanism over vague ones.
Common mistakes and self-check
- Mistake: Jumping to solutions. Self-check: Did you list candidates and expected signals before proposing a fix?
- Mistake: Confusing symptom with cause. Self-check: If we remove this, does the problem stop?
- Mistake: Ignoring negating evidence. Self-check: What would disconfirm your top candidate?
- Mistake: Overfitting to anecdotes. Self-check: Does the candidate explain all affected segments and exclude the unaffected ones?
- Mistake: Too many candidates with no ranking. Self-check: Have you prioritized by timing, coverage, and mechanism?
Exercises
These are available to everyone. Sign in to save your progress.
Exercise 1: Trial-to-paid conversion drop after UI refresh
See details in the Exercises section below (Ex1). Produce 3–5 candidates, rank the top 3, and list 2–3 quick checks per candidate.
Exercise 2: Nightly ETL delay
See details in the Exercises section below (Ex2). Identify immediate checks, top candidates, and write a testable hypothesis for your #1.
- Checklist before you submit:
- Problem precisely defined (metric, who/where/when)
- At least 3 candidates across different buckets
- Ranking justified with timing and coverage
- Each top candidate has measurable expected signals
Mini challenge
Your marketing dashboard shows a sudden drop in attributed revenue from Paid Social, but overall site revenue is flat. Draft two root cause candidates and one testable hypothesis for your top pick. Use: “If C is true, we expect E1/E2.”
One possible approach
- Candidate A: UTM parsing broke for Facebook click IDs (Technology/Data). Expect: zero fbclid in logs; jump in "Direct" traffic; no change in checkout volume.
- Candidate B: Channel mapping rule changed (Process). Expect: spike in “Unassigned” channel; mapping commit in last 24h.
- Hypothesis: If UTM parsing broke, then fbclid presence drops >90% post-deploy and Paid Social revenue reappears as Direct.
Practical projects
- Create a “Root Cause Playbook” for one product area: include top metrics, system map, common failure modes, and go-to checks.
- Run a mock incident drill: pick a historical metric drop and timebox 60 minutes to identify, rank, and test candidates; document outcomes.
- Build a candidate scoring sheet: columns for Timing, Coverage, Mechanism, Disconfirmers. Use it on two recent issues and compare speed/accuracy.
Who this is for
- Business Analysts, Product Analysts, and Operations Analysts who explain metric movements and incidents.
Prerequisites
- Basic familiarity with product/process metrics and segmentation.
- Ability to read simple logs or dashboards.
- Comfort describing systems at a high level.
Learning path
- Before: Problem definition and metric literacy.
- Now: Identifying strong root cause candidates.
- Next: Testing hypotheses and validating fixes.
Next steps
- Do the exercises and take the Quick Test below. Anyone can take it; sign in to save progress.
- Apply the 5-step method to a live issue this week and log candidates, evidence, and results.