Who this is for
This lesson is for Analytics Engineers and BI practitioners who own or support data models, dashboards, or pipelines and need a calm, consistent way to handle incidents and find true root causes.
Prerequisites
- Basic SQL (SELECT, JOINs, WHERE, GROUP BY)
- Familiarity with your data pipeline tool (e.g., scheduler, transformations, tests)
- Understanding of data lineage concepts (upstream/downstream)
Why this matters
Real tasks you will face:
- A CFO dashboard shows revenue down 40% overnight. Is it real, or a data issue?
- A model fails after an upstream schema change. How fast can you restore service and prevent repeats?
- A daily load is late on quarter-end. Who do you notify, and how do you prioritize?
Clear triage and root cause practices reduce downtime, prevent repeat incidents, and build trust in data.
Concept explained simply
Triage decides what to do first when something breaks. Root cause analysis (RCA) finds the real reason it broke so it does not happen again.
Mental model: The Fire Triangle
Think of incidents like a small fire:
- Detect: See smoke (alert or report)
- Contain: Stop spread (pause consumers, rollback, mute dashboards)
- Eliminate cause: Remove the heat source (fix the defect, add guardrails)
Standard triage steps
- Detect: A test fails, a freshness alert triggers, or a user reports bad data.
- Acknowledge & log: Create an incident note with timestamp, reporter, assets affected, current status.
- Assess impact & severity: Who is affected? Which dashboards/models? How many users? Is this critical business period?
- Contain: Stop the bleeding. Options: pause downstream jobs, revert to last good version, flag dashboards with a banner, or temporarily exclude bad partitions.
- Communicate: Assign an incident lead, set status update cadence (e.g., every 30–60 minutes for high severity).
- Find root cause: Use change diff, lineage walk, and slice-and-dice to isolate the exact breaking change.
- Fix & validate: Apply minimal safe fix, backfill if needed, validate with checks and stakeholder sign-off.
- Document & prevent: Write a brief RCA, add tests/monitors/runbook steps, schedule a retro if severity was high.
Severity levels (quick matrix)
- S1 Critical: Executive/business-critical dashboards broken or key deliveries blocked; revenue/compliance risk now. Immediate response, 24/7.
- S2 High: Many users or critical teams affected; SLA risk within hours. Same day response.
- S3 Medium: Limited scope or easy workaround; respond during business hours.
- S4 Low: Cosmetic/minor discrepancies; plan into backlog.
Root cause methods (practical)
- Change diff: List what changed since last healthy state (code, configs, sources, schedules, volumes).
- Lineage walk: Trace upstream dependencies step-by-step. Check each node for freshness, schema, and volume anomalies.
- Slice-and-dice: Narrow by time window, segment, or partition to find where data first looks wrong.
- Reproduce minimally: Build the smallest SQL that still shows the bug. This reduces noise.
- Five Whys: Ask why repeatedly until you reach a process/guardrail fix, not an individual mistake.
- Compare environments: Prod vs. staging or last successful run vs. current failed run.
- Null lineage: Track where nulls or zeros first appear across joins and transforms.
RCA one-pager template
- Summary: One-paragraph description + severity
- Timeline: First detection → containment → fix → validation
- Impact: Users, dashboards/models, time window, decisions affected
- Root cause: The specific defect and why it happened
- Remediation: Fix applied and validation steps
- Prevention: Tests/alerts/process changes and owners
- Follow-ups: Tickets with due dates
Worked examples
Example 1: Upstream rename breaks models
Signal: Daily build fails; error: column user_id not found.
Triage:
- Contain: Pause downstream dashboards refresh; notify #data-status.
- Impact: All user-level metrics for last 24h affected → S2 High.
Root cause:
- Change diff shows source_users.user_id renamed to id in the last deploy.
- Lineage walk confirms all joins expect user_id.
Fix:
- Update models to use id as user_id alias; add schema test to enforce required columns.
- Backfill last day; validate counts and primary key uniqueness.
Example 2: Sudden revenue drop on dashboard
Signal: Executive ping: revenue -35% overnight.
Triage:
- Contain: Add dashboard banner "Under investigation"; freeze decisions based on this metric.
- Impact: CFO dashboard + daily standup reports → S1 Critical (quarter-end).
Root cause:
- Slice-and-dice shows drop only in EU region.
- Lineage reveals EU partition load failed due to API quota.
Fix:
- Re-run EU ingestion after quota reset; add alert when partition row count deviates >20%.
- Validate EU totals and grand total; remove banner.
Example 3: Null explosion after join
Signal: Data test fails: order.customer_id is null for 12% rows.
Triage:
- Contain: Exclude affected partition from publishing; notify consumers of partial data.
- Impact: Several marketing models depend on customer_id → S2 High.
Root cause:
- Minimal query shows LEFT JOIN orders o LEFT JOIN customers c ON o.customer_id = c.id AND c.is_active = true.
- Policy changed; inactive customers removed from dimension → unexpected filter in join condition caused nulls.
Fix:
- Move is_active filter out of join predicate into a downstream model where business logic belongs.
- Add data test: join key coverage ≥ 99.9%.
Communication during incidents
- Assign roles: Incident Lead (one person). Optional: Comms owner.
- Set cadence: S1 = every 30–60 minutes; S2 = every 2–4 hours; S3/S4 = daily or as needed.
- Message format: What happened, impact, next update time, current actions, ask for info if needed.
Incident note quick template
- When detected:
- Reporter:
- Affected assets:
- Severity:
- Current status:
- Next update by:
- Leads/Owners:
Common mistakes and self-check
- Jumping to fix without scoping impact. Self-check: Can you list all affected dashboards/models and their users?
- Skipping containment. Self-check: Did you prevent more bad data from publishing?
- Stopping at proximate cause. Self-check: Does your RCA include a prevention step owned by someone?
- Overfitting to one tool alert. Self-check: Did you verify with lineage and minimal reproduction?
- Not writing a brief RCA. Self-check: Can a teammate understand the incident from your one-pager?
Practice: Exercises
Do these now. The quick test below will check understanding. Note: Anyone can take the test; only logged-in users will have saved progress.
- Exercise ex1 — Severity and triage plan
Scenario: Marketing dashboard shows 0 conversions for last 12 hours. Source ingestion for events is late; product analytics team pinged you. It is mid-week campaign period.
Deliver: severity level, first 3 containment actions, suspected root cause signals to check. - Exercise ex2 — Minimal failing query
Scenario: Order totals doubled day-over-day but unique orders are stable. Investigate with a minimal SQL reproduction and list next checks.
Checklist before you say “fixed”
- Impact documented and communicated
- Containment applied and verified
- Root cause identified with evidence
- Fix implemented and validated with tests/metrics
- Backfill performed if needed
- RCA written with prevention steps and owners
Practical projects
- Project 1: Create an incident runbook for your top 5 critical dashboards. Include severity matrix, comms templates, and step-by-step checks.
- Project 2: Add three proactive tests or monitors (freshness, volume anomaly, key coverage) and simulate a failure to validate your alerting and triage flow.
Learning path
- Start: Incident basics (this lesson)
- Next: Data tests and monitors (freshness, schema, volume, distribution)
- Then: Reliable deployments and change management
- Finally: Post-incident reviews and continuous improvement
Mini challenge
You notice daily revenue is exactly double, only for one day. In one paragraph, state your triage plan (containment + impact) and top 3 root cause checks. Keep it to five sentences.
Next steps
- Adopt the one-pager RCA template for your next incident.
- Schedule a 30-minute drill: simulate a failed source and practice your communication cadence.
- Take the quick test below to confirm mastery.