Menu

Topic 6 of 8

Root Cause Analysis Process

Learn Root Cause Analysis Process for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

As a Data Architect, you are accountable for the trustworthiness of data platforms. When data quality incidents happen (late data, schema breaks, duplicates, wrong values), Root Cause Analysis (RCA) helps you quickly contain impact, find the true source of failure, and implement fixes that prevent recurrence. Typical real tasks include:

  • Triaging a freshness breach that delays executive dashboards.
  • Tracing a schema change from an upstream service that broke ingestion jobs.
  • Explaining an unexpected drop in metrics caused by deduplication logic.
  • Coordinating with platform, analytics, and source teams to implement preventative controls.

Who this is for

  • Data Architects and Leads responsible for platform reliability and data contracts.
  • Senior Data Engineers who own ingestion, transformation, or orchestration layers.
  • Analytics Engineers who debug downstream model issues.

Prerequisites

  • Basic understanding of data pipelines (ingestion, storage, transformation, serving).
  • Familiarity with data quality dimensions (freshness, completeness, uniqueness, validity, accuracy).
  • Ability to read logs/metrics and interpret lineage and orchestration runs.

Concept explained simply

Root Cause Analysis is a structured way to move from symptoms (what you observe) to the underlying cause (what truly created the symptom) and then prevent it from happening again. It is not about blame; it is about systems thinking.

Mental model: your data system as a water network

Imagine raw data as water entering pipes. Valves control flow (schedulers), filters clean water (transformations), tanks store it (warehouses/data lakes), and gauges measure pressure/quality (observability). A puddle on the floor (a bad dashboard value) is a symptom. RCA traces upstream along the pipes to find the leaky joint (root cause), patches it (fix), and adds a sensor or valve (preventive control) to stop future leaks.

Standard RCA workflow

1) Detect and triage

  • Confirm the alert/symptom and its severity.
  • Identify the blast radius: which datasets, reports, and stakeholders are affected.

2) Define the problem precisely

  • Write a one-sentence problem statement with measurable symptoms (what, where, since when).
  • Capture example rows/partitions and error messages.

3) Contain

  • Quarantine bad partitions or pause downstream refreshes if needed.
  • Communicate a clear status update and ETA; avoid speculative causes.

4) Build a timeline

  • List recent deployments, config changes, backfills, and incident start time.
  • Correlate orchestration runs, retries, and failures.

5) Map dependencies

  • Use lineage to trace upstream sources and downstream dependents.
  • Identify candidate layers: source system, ingestion, storage, transform, serving.

6) Generate hypotheses

  • Apply 5 Whys and/or a Fishbone (People, Process, Platform, Data) to list plausible causes.
  • Prioritize by likelihood and ease of test.

7) Test and isolate

  • Run targeted checks: schema diffs, partition counts, sample queries, replay subsets.
  • Stop when evidence isolates the single most plausible root cause.

8) Fix and verify

  • Apply the minimal safe fix, backfill as needed, and verify metrics/lineage return to normal.
  • Monitor closely for one full cycle (e.g., 24 hours).

9) Prevent and document

  • Add preventive controls (data contracts, tests, monitors, idempotent writes).
  • Publish a concise RCA note: problem, impact, timeline, root cause, fix, prevention, owners, dates.

Worked examples

Example 1: Freshness breach on a daily table

Symptom: table sales.daily is 18h late; dashboards stale since 07:00.

Timeline: New ingestion job version deployed at 23:55; first run 00:05 succeeded; upstream export retried at 00:40.

Investigation: S3 partition for dt=2026-01-17 exists but size=0 bytes; job marked success because it checks path exists, not size.

Root cause: Ingestion success criteria too weak; empty partitions pass.

Fix: Update job to validate min record count and non-zero byte size; rerun for dt=2026-01-17.

Prevention: Add completeness test on sales.daily; alert on zero-size partitions; update runbook.

Example 2: Schema drift breaks transformations

Symptom: dbt model customer_orders fails: column customer_id not found.

Timeline: API v2 released 09:10; loader started 09:15; first failure 09:16.

Investigation: Upstream renamed customer_id to cust_id and added nullable field vip_flag.

Root cause: Breaking schema change without contract; transformation expects old name.

Fix: Add rename mapping in ingestion (cust_id -> customer_id) and re-run.

Prevention: Data contract with breaking-change checks; CI test to fail on removed/renamed fields; a compatibility window with dual-write/aliasing.

Example 3: Duplicate events after a backfill

Symptom: page_views count increased 2x; uniqueness test on (event_id) fails.

Timeline: Backfill run executed for last 7 days; dedup model skipped to save time.

Investigation: Upsert step used insert-only; no conflict key; dedup disabled.

Root cause: Non-idempotent backfill and missing merge key logic.

Fix: Re-run using MERGE on event_id and event_timestamp; enable dedup step.

Prevention: Standard backfill playbook; idempotent writes; guardrail to block insert-only replays on dedup-sensitive tables.

Tools and signals to inspect

  • Freshness and latency metrics (by table/partition)
  • Volume/completeness (row counts, zero-size partitions, null ratios)
  • Schema diffs (added/removed/renamed columns, type changes)
  • Quality tests (uniqueness, referential integrity, validity)
  • Orchestrator run logs, retries, durations
  • Lineage graph (upstream sources; recent changes)
  • Deployment/config changes, Git diffs, feature flags
  • Source system status and release notes
Tip: fast isolation heuristic

If multiple downstream models fail simultaneously, suspect an upstream data or contract change. If only one model fails while its siblings succeed, suspect transformation logic or filters specific to that model.

Common mistakes and self-checks

  • Jumping to fixes without containment. Self-check: Have you paused or quarantined to limit impact?
  • Confusing symptom with cause. Self-check: Can your cause independently reproduce the symptom?
  • Incomplete timeline. Self-check: Did you list deployments, backfills, and config changes?
  • Not validating the fix. Self-check: Did metrics return to normal for a full cycle?
  • No prevention. Self-check: What control stops this from recurring?

Exercises

Complete the tasks below. Then compare your answers with the solutions in the collapsible sections.

Exercise 1: Trace a freshness breach

Given: sales.daily freshness alert at 07:00; orchestrator shows load_daily_sales succeeded at 01:05; S3 dt=2026-01-17 partition exists but has 0 bytes; upstream export_orders started 00:10, failed once, retried 00:40.

Write a short RCA note with:

  • Symptom
  • Timeline
  • Root cause hypothesis
  • Tests to confirm
  • Fix
  • Preventive actions

Exercise 2: Resolve schema drift

Yesterday schema: {"customer_id": "string", "order_total": "decimal(10,2)"}

Today schema: {"cust_id": "string", "order_total": "decimal(10,2)", "vip_flag": "boolean"}

Downstream error: column customer_id not found.

Propose:

  • Immediate rollback/containment plan
  • Medium-term fix
  • Preventive control
Checklist before you submit
  • I wrote a measurable problem statement.
  • I built a timeline of events.
  • I proposed at least one preventive control.

Practical projects

  • Set up a small pipeline with a synthetic upstream CSV feed. Introduce a breaking rename and perform a full RCA, including a prevention PR that adds a schema contract test.
  • Implement an idempotent backfill script for an events table using a MERGE on a composite key. Run a safe replay and document the RCA steps you would take if duplicates appear.
  • Create a runbook template for incidents: include fields for impact, blast radius, timeline, root cause, fix, prevention, owners, and communication notes.

Learning path

  1. Master data quality dimensions and monitoring signals.
  2. Learn RCA techniques (5 Whys, Fishbone) and practice on past incidents.
  3. Implement data contracts and CI tests to catch schema drift.
  4. Design idempotent ingestion and backfill patterns.
  5. Standardize runbooks and post-incident documentation.

Mini challenge

Your daily finance mart shows a sudden 12% dip only on the latest partition; upstream volumes and freshness look normal. What two checks do you run first, and why? Write your answer in two bullets focusing on isolating transform logic vs. source data change.

Next steps

  • Adopt the 9-step workflow in your team’s runbook.
  • Add at least two preventive controls (schema contract test, idempotent writes) this week.
  • Schedule a short retro after your next incident to improve detection and containment.

Quick Test

Take the quick test to check your understanding. Available to everyone; only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

You received a freshness alert: sales.daily is 18h late. Orchestrator shows load_daily_sales succeeded at 01:05. The S3 partition for dt=2026-01-17 exists but size=0 bytes. The upstream export_orders job started 00:10, failed once, retried 00:40.

Write a concise RCA note covering:

  • Symptom
  • Timeline
  • Root cause hypothesis
  • Tests to confirm
  • Fix
  • Preventive actions
Expected Output
A clear RCA note that identifies empty upstream partition as the likely cause, shows a timeline, proposes tests (count>0, sample rows), fix (validate size/count, reprocess), and prevention (completeness tests, contract).

Root Cause Analysis Process — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Root Cause Analysis Process?

AI Assistant

Ask questions about this tool