Why this matters
As a Data Architect, you design systems that must stay trustworthy under pressure. When data pipelines fail, metrics drift, or sensitive data leaks, a clear incident playbook turns chaos into coordinated action. Your teams need fast decisions, consistent communication, and safe recovery paths.
Note on progress and access
You can read this lesson and take the quick test for free. If you log in, your progress will be saved automatically.
- Real tasks you will face: declaring severity, coordinating on-call responders, deciding rollback vs. fix-forward, communicating ETAs to stakeholders, and running blameless post-incident reviews.
- Outcomes: reduced time to detect (TTD), time to mitigate (TTM), mean time to recovery (MTTR), and fewer repeat incidents.
Concept explained simply
An incident management playbook is a predefined, step-by-step guide for handling specific classes of data issues. It tells people what to do, who owns which decisions, how to assess impact, and how to communicate.
Runbook vs. playbook:
- Runbook: precise, technical steps for a narrow task (e.g., restart a job, reprocess a partition).
- Playbook: the broader process around the incident (declare severity, assemble roles, choose runbook, communicate, and follow up).
Mental model
Think in three layers:
- Detection: reliable alerts and clear triggers.
- Coordination: roles, severity, timelines, and decisions.
- Recovery and learning: safe mitigation, verification, and improvement.
Use this flow: Trigger → Triage → Declare severity → Assign roles → Stabilize → Communicate → Mitigate → Verify → Close → Review.
Core components of a solid playbook
- Triggers and scope: what starts the playbook (alert names, dashboards, business complaints).
- Severity matrix: consistent rules to classify impact and urgency.
- Roles: Incident Commander (IC), Comms Lead, Scribe, Subject Matter Experts (SMEs), Approver/Owner.
- Golden signals: freshness, completeness, accuracy, schema, lineage impact.
- Decision trees: rollback vs. fix-forward; hotfix vs. feature freeze; notify who and when.
- Runbook references: links replaced here with named procedures (e.g., "Reprocess failed partition", "Backfill table", "Mask PII").
- Communication templates: internal updates, stakeholder notices, closure summaries.
- Metrics: TTD, TTM, MTTR, incidents by class, false-positive rate, reopen rate.
Severity matrix example (customize for your org)
- Sev1: Regulatory, PII exposure, or business-critical outage; customer decisions blocked; MTTR target < 2h.
- Sev2: Major data delay or accuracy issue affecting key dashboards/ML models; MTTR target same day.
- Sev3: Degraded non-critical data or small user segment; MTTR target 48h.
- Sev4: Cosmetic, documentation, or low-risk data drift; MTTR target as planned.
Communication template (internal)
[Initial] Incident ID, Summary, Severity, Start time, Known impact, Affected assets, Next update time.
[Update] Changes since last update, Actions taken, Current hypothesis, ETA for next checkpoint.
[Closure] Root cause summary, Fix applied, Verification results, Follow-ups with owners and deadlines.
Worked examples
Example 1: Broken daily pipeline (late data)
- Trigger: Freshness alert on "sales_daily" exceeds SLA (06:00 cutoff missed).
- Triage: Confirm job logs show upstream extract failure. Check lineage for impacted dashboards.
- Severity: Sev2 (key dashboards delayed).
- Roles: IC (on-call data engineer), Comms Lead (analytics manager), SME (platform engineer).
- Decisions: Fix-forward if upstream source healthy; otherwise rollback to last good partition and publish partial with disclaimer.
- Mitigation: Re-run extract with known-good window; if fails, bypass to last good snapshot.
- Verification: Row counts, primary KPI deltas within tolerance, dashboard render OK.
- Closure: Add retry with jitter, alert on upstream API quota, document SLA risk.
Example 2: Data quality rule fails for PII exposure
- Trigger: DQ rule flags unmasked emails in "customer_events".
- Triage: Validate rule signal is real. Confirm data landed in any consumer tables.
- Severity: Sev1 (possible PII exposure).
- Roles: IC (privacy-trained IC), Comms Lead (security liaison), SME (data platform), Approver (data owner).
- Decisions: Immediate containment vs. analysis; access freeze for affected tables.
- Mitigation: Quarantine partitions, apply masking runbook, revoke downstream access until verified.
- Verification: Sample masked records, audit downstream usage logs, confirm zero exposure.
- Closure: Root cause (ingestion transform missed), add pre-write masking check, require schema contract for sensitive fields.
Example 3: Metric drift anomaly on conversion rate
- Trigger: Anomaly detector flags a 30% drop in conversion metric.
- Triage: Correlate with traffic, pricing, and attribution; check model inputs freshness.
- Severity: Sev2 if decisioning depends on metric today; Sev3 if monitoring-only.
- Roles: IC (analytics lead), SMEs (tracking, ingestion, ML).
- Decisions: Data issue vs. business event; hold automated decisions if uncertain.
- Mitigation: Backfill missing tracking events; disable model decisions temporarily if inputs stale.
- Verification: Recompute metric; confirm normal ranges.
- Closure: Add source heartbeat alert, strengthen anomaly thresholds and warmup.
Build your playbook in 60 minutes
- Define scope (10m): List top 5 incident types: freshness breach, failed load, schema change, bad joins, PII exposure.
- Set severity rules (10m): Use impact and blast radius to classify Sev1–Sev4.
- Assign roles (5m): Name IC rotation, Comms Lead, Scribe, and SMEs.
- Draft decision trees (15m): Rollback vs. fix-forward; pause consumers vs. continue with caveats.
- Prepare runbook references (10m): Reprocess partition, backfill table, mask sensitive columns, revert schema.
- Write comms templates (5m): Initial, update, closure.
- Define metrics (5m): TTD, TTM, MTTR, false-positive rate, reopen rate.
Exercises
Complete these to practice. If you log in, your progress will be saved; otherwise, you can still do them for free.
Exercise 1: Severity matrix and SLAs
Create a simple severity matrix and SLAs for your data platform.
- Include 4 severities (Sev1–Sev4) with impact definitions.
- Set MTTR targets and who must be notified for each severity.
- Define the trigger examples for each severity.
Expected output: a short list mapping severity → impact → MTTR → notifications → triggers.
Exercise 2: Draft a focused playbook
Scenario: A dbt model that builds the daily "orders_summary" table fails at 05:30, SLA is 06:00 for stakeholders.
- Write the trigger, severity, roles, 15-minute timeline (T+0 to T+15), decision points, mitigation, verification, and closure notes.
Expected output: a 1-page playbook text suitable for your team wiki.
- Checklist to self-check:
- Severity rules are objective and impact-based.
- Roles are clear with backups.
- Decision tree covers rollback and fix-forward.
- Comms template has initial, update, and closure sections.
- Verification steps confirm data accuracy and completeness.
Common mistakes and self-check
- Vague severity criteria → Fix: tie to impact and SLA breach.
- No clear IC → Fix: define rotation and how to declare IC.
- Skipping verification → Fix: always include measurable checks.
- Over-notifying or under-notifying → Fix: map severity to audiences and cadence.
- One-time fixes without learning → Fix: require a short, blameless review with actions and owners.
Self-check prompts
- Can any on-call person declare severity in 2 minutes using your matrix?
- Does your playbook work at 3am with minimal context?
- Would a new hire know whom to ping and what to say?
- Are rollback and verification steps safe and reversible?
Practical projects
- Create a company-specific incident template: duplicate the sections above and add your data assets and contacts.
- Run a 30-minute tabletop drill: simulate a Sev2 freshness breach, practice comms updates every 10 minutes.
- Instrument metrics: capture TTD, TTM, MTTR from your alerting and ticketing notes.
Learning path
- Before this: Alerts and SLAs, Data lineage basics, Data quality rules.
- This lesson: Incident playbooks and execution.
- Next: Post-incident reviews, Continuous improvement of alerts, Operational readiness checks.
Who this is for
- Data Architects defining operational standards.
- Data Engineers and Analytics Engineers on-call.
- Platform Engineers supporting data infra.
Prerequisites
- Basic knowledge of your orchestration tool (e.g., Airflow) and transformation layer (e.g., dbt).
- Awareness of key datasets, business KPIs, and data SLAs.
Next steps
- Finish the exercises and take the quick test below.
- Schedule a tabletop drill this week.
- Add your playbook to the team wiki and review it quarterly.
Mini challenge
In 5 minutes, write a one-paragraph initial incident announcement for a Sev2 freshness breach affecting the executive dashboard. Include summary, impact, start time, and next update time.