Who this is for
Data Engineers, Analytics Engineers, and Platform Engineers who need to find and fix the real reason behind data incidents fast, then prevent them from coming back.
Prerequisites
- Basic SQL: filtering, aggregations, joins.
- Familiarity with your data pipelines (scheduler/orchestrator, logs, jobs, storage layout).
- Understanding of data quality dimensions: freshness, completeness, validity, uniqueness, accuracy, consistency.
Why this matters
Real tasks you will face:
- Yesterday's revenue dashboard dropped by 35% because a partition didn’t arrive. You must find the cause and backfill.
- Duplicate orders in fact tables inflate metrics. You must trace where duplication was introduced and add a guard.
- An upstream team renamed a field and your job broke. You must pinpoint the change, patch transforms, and prevent silent schema drifts.
Root Cause Analysis (RCA) helps you stop the bleeding quickly, identify the true mechanism behind the failure, and implement durable fixes.
Concept explained simply
RCA is a structured way to answer three questions: What exactly is broken? Why did it happen? How do we prevent it from happening again?
Mental model
Imagine a leaking pipe with many joints. You see water on the floor (symptom). You wipe it (containment), but unless you find the cracked joint (root cause) and replace it (corrective + preventive action), the leak will recur.
Step-by-step RCA recipe
Describe impact, scope, time window, and symptoms. Example: "fact_orders shows +18% duplicates for 2026-01-07; dashboards overstate revenue by ~18%."
Pause downstream loads/refreshes, quarantine bad partitions, add temporary filters to dashboards.
- Quality signals: freshness/row-count/uniqueness/validity checks.
- Execution proofs: job run logs, task durations, retry counts.
- Data comparisons: diffs between last-good vs current, partition-level stats.
- Lineage/context: which upstream tables, topics, or files feed this dataset.
- Change surface: recent code/config/schema/infra changes (commits, deploys, schema registry).
Find the first upstream node that looks wrong; continue until you find the last-known-good handoff.
Write candidate causes; test with data slices, targeted queries, and log review. Prefer fast, disconfirming tests.
Choose the specific mechanism that explains the failure chain, not just the symptom (e.g., "timezone misconfig in ingestion added 24h offset" vs "late data").
- Containment: stop making it worse.
- Corrective: repair data (backfill, dedupe, reprocess).
- Preventive: add tests, contracts, alerts, retries, idempotency.
Re-run checks, compare metrics to baseline, and monitor for a few cycles.
Capture timeline, evidence, root cause, impact, and actions. Make it searchable.
Methods: 5 Whys, Fishbone, Fault Tree (when to use each)
- 5 Whys: quickest for narrow issues. Stop when a "why" yields a systemic cause you can act on.
- Fishbone (Ishikawa): good for multi-factor problems (People, Process, Tools, Data, Environment).
- Fault Tree: top-down logic for complex failures; useful for platform-wide reliability analysis.
Signals and typical tools
- Freshness/latency monitors to detect late partitions or stuck streams.
- Row-count and null-rate profiles to catch completeness shifts.
- Uniqueness/dedup checks on business keys to prevent double-counting.
- Schema drift detectors for added/removed/renamed fields.
- Orchestrator and ingestion logs for failures, retries, and durations.
- Lineage views to map upstream/downstream blast radius.
- Version control history for code/config changes around incident time.
Worked examples
Example 1 — Late partitions due to timezone
- Problem: fact_orders missing dt=2026-01-07; dashboard shows 0 revenue for that day.
- Evidence: Freshness check fails for dt=2026-01-07. Ingestion logs show partitions stamped in UTC-5 instead of UTC.
- Trace: Upstream file names changed from dt=YYYY-MM-DDZ to dt=YYYY-MM-DD-0500 after an agent update.
- Root cause: Ingestion agent timezone default changed; partition parser didn’t normalize to UTC.
- Fixes:
- Containment: Pause daily model; hide date from dashboard.
- Corrective: Reprocess raw files with UTC normalization; backfill partitions.
- Preventive: Add parser unit test + freshness alert on partition gaps; pin agent timezone config.
- Verify: Freshness green; revenue matches historical range.
Example 2 — Duplicate transactions from null event_id
- Problem: +18% duplicate rows in fact_transactions on 2026-01-06.
- Evidence: Uniqueness check on event_id fails; many NULL event_id entries observed.
- Trace: Upstream producer rolled out optional event_id for retries; retry events missing id.
- Root cause: Dedup logic relied solely on event_id; when NULL, duplicates passed through.
- Fixes:
- Containment: Filter NULL event_id temporarily in BI layer.
- Corrective: Dedupe using composite key (user_id, amount, ts bucket) and write a cleanup backfill.
- Preventive: Null-safe dedup key, NOT NULL constraint or fallback id; add uniqueness test on composite key.
- Verify: Duplicate rate returns to baseline (<0.1%).
Example 3 — Upstream rename broke a transform
- Problem: users_dim build failed; error: column user_email not found.
- Evidence: Schema drift detected; upstream payload now uses email instead of user_email.
- Trace: Commit earlier that day changed field name; no contract gate in pipeline.
- Root cause: Missing schema-compatibility contract; transform assumed stable field name.
- Fixes:
- Containment: Stop downstream refreshes; communicate outage window.
- Corrective: Update transform to coalesce(new.email, old.user_email); re-run job.
- Preventive: Add schema checks and fail-fast gate; require deprecation window for field renames.
- Verify: Job succeeds; data validity checks pass.
Exercises — practice now
These mirror the exercises below. Do them in your own words before opening solutions.
- Exercise 1: Diagnose a spike in NULL emails in users_dim and propose containment, corrective, and preventive actions.
- Exercise 2: Investigate a stream ingestion delay and plan experiments to confirm the cause.
Checklist to keep you on track:
- Problem statement includes impact, scope, and time window
- Containment action identified
- At least two competing hypotheses listed
- A fast test to disprove one hypothesis
- Clear root cause named (mechanism, not symptom)
- Corrective and preventive actions specified
Common mistakes and self-checks
- Mistaking symptom for cause: If your "cause" could be said about many incidents (e.g., "bad data"), dig deeper. Ask one more why.
- Skipping containment: If dashboards keep updating with bad data, pause them first.
- One-hypothesis tunnel vision: List at least two plausible alternatives and try to disconfirm your favorite.
- No verification: After fixes, re-run checks and compare to a known-good baseline.
- Weak prevention: Add a test/alert/contract that would have caught this earlier next time.
Practical projects
- Build a mini RCA runbook: a template and example for your team. Include sections for problem, timeline, evidence, root cause, fixes, verification.
- Add three quality checks to a critical table (freshness, uniqueness, validity) and wire an alert. Simulate a failure and run an RCA on it.
- Create a schema drift guard: compare expected vs actual schema for a feed and fail fast with a clear error message.
Learning path
- Before this: data quality dimensions, monitoring/alerting basics, pipeline orchestration.
- This subskill: structured RCA, containment/correction/prevention.
- Next: incident communication, SLAs/SLOs for data, backfilling strategies, data contracts.
Next steps
- Adopt the RCA recipe on your next incident. Time-box the initial investigation and write down your hypotheses.
- Productize at least one preventive control (a new test, contract, or alert).
- Share a short post-incident note with your team.
Mini challenge
Your completeness check drops 10% for dt=2026-01-05 on events_raw. Within 30 minutes, produce: (1) a crisp problem statement, (2) two hypotheses, (3) one fast disconfirming test, (4) one preventive control you would add. Use the checklist above.
Quick Test
Take the quick test to check your understanding. Everyone can take it for free; if you are logged in, your progress will be saved.