luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Incident Management And On Call

Learn Incident Management And On Call for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As a Data Platform Engineer, you safeguard data pipelines and analytics reliability. When incidents hit—late dashboards, broken ETL, bad data—teams downstream lose trust. Strong incident management minimizes impact and restores confidence fast.

  • Real tasks you will do: triage failing jobs, mute noisy alerts, escalate data quality breaches, coordinate incident calls, communicate status to stakeholders, and lead post-incident reviews.
  • Target outcomes: shorter mean time to detect (MTTD), mean time to resolve (MTTR), fewer repeat incidents, and predictable on-call rotations that reduce burnout.

Concept explained simply

An incident is any unplanned event that degrades the data platform or the trustworthiness of its data. Incident management is the repeatable playbook to detect, decide severity, act, communicate, and learn.

Mental model

Think of a fire drill for data. You have a designated commander, clear exits (runbooks), alarms (alerts and SLOs), and a debrief. The system is ready, even if the emergency is a surprise.

Key terms (open)
  • SLO: Service Level Objective, e.g., ">= 99% of daily jobs finish by 7 AM".
  • SLA: External commitment, often to business users.
  • Severity (Sev): Impact level, e.g., Sev1 = company-wide data outage.
  • Runbook: Step-by-step guide to handle a known failure.
  • Incident Commander (IC): Decision-maker and coordinator during an incident.

Step-by-step incident playbook

  1. Detect: Alert triggers based on SLO breach, anomaly, or user report.
  2. Triage: Confirm impact and scope. Assign an Incident Commander. Set severity.
  3. Stabilize: Contain damage (stop bad data propagation, disable consumers, backfill plan).
  4. Communicate: Send initial status (what, impact, ETA), then regular updates.
  5. Resolve: Implement fix, validate, and restore normal operation.
  6. Review: Blameless post-incident, root cause, action items with owners and due dates.
Severity guide (example)
  • Sev1: Company-critical dashboards wrong/missing; financial or regulatory exposure.
  • Sev2: Team-critical pipelines delayed or degraded; noticeable business impact.
  • Sev3: Partial degradation; workaround exists; limited user scope.
  • Sev4: Minor; cosmetic or non-urgent bug.

Worked examples

Example 1: Broken pipeline produces null revenue
  • Signal: Data quality check fails; nulls spike to 25% in revenue_amount.
  • Impact: Finance dashboards show zero revenue for EMEA; executives rely on it daily.
  • Actions:
    • IC sets Sev2, pauses downstream dashboards for EMEA to prevent decisions on bad data.
    • Identify recent schema change in source API (amount renamed).
    • Hotfix mapping; backfill last 24 hours.
  • Communication: Initial update in 10 minutes; hourly until resolved; final notice with fix and backfill time.
  • Review: Add schema change contract test; tighten alert to catch field rename earlier.
Example 2: Warehouse overload delays SLAs
  • Signal: Job queue backlog; warehouse concurrency maxed; daily SLA at risk.
  • Impact: Marketing attribution tables late; daily SLA 7 AM likely missed.
  • Actions:
    • IC sets Sev3; enable workload management to prioritize SLA-critical models.
    • Cancel non-critical ad-hoc queries; scale up compute for 2 hours.
    • After catch-up, backfill missed incremental partition.
  • Review: Reserve capacity for critical DAGs; add guardrails to auto-throttle heavy ad-hoc queries.
Example 3: PII appears in analytics table
  • Signal: Automated classifier flags potential PII in a public analytics dataset.
  • Impact: Compliance risk; internal users can access sensitive fields.
  • Actions:
    • IC sets Sev1; quarantine table; revoke access; notify security and compliance.
    • Trace lineage to upstream raw ingestion; fix transformation to hash/remove PII.
    • Rebuild table; verify masking; restore access after sign-off.
  • Review: Add data contract on PII fields; enforce scans pre-publish; improve lineage alerts.

On-call setup

  • Coverage: 24/7 for critical SLOs, business hours for others; define handoff windows.
  • Rotation: Primary and secondary on-call; weekly rotations reduce context switching.
  • Escalation: If primary does not acknowledge in 5 minutes, page secondary; after 15 minutes, page manager/IC.
  • Noise control: Group related alerts; add auto-muting for flapping signals; use rate limits.
  • Runbooks: One per top incident type; include decision trees and rollback steps.
Handoff checklist
  • Open incidents status and next update time
  • Known degradations and temporary mitigations
  • Upcoming risky changes or backfills
  • Pager routes tested and quiet hours confirmed
  • Runbook changes since last rotation

Communication templates

Initial incident update (template)
Subject: [SevX] Incident: Short name (Start time)
Status: Investigating
Impact: Who is affected and how
Scope: Systems / tables
Next update: Time window
IC: Name
Actions so far: 1-2 bullets
Ask: If you are affected, do X
Resolution update (template)
Subject: [Resolved] Incident: Short name
Impact: Summary
Root cause: Brief root cause
Fix: What we changed
Data corrections: Backfill/window
Prevention: Top 1-2 changes
Contact: Channel

Common mistakes and self-check

  • Mistake: Fixing silently. Self-check: Did you send an initial update within 15 minutes?
  • Mistake: No clear commander. Self-check: Is an IC named in the incident record?
  • Mistake: Over-alerting. Self-check: Are >30% of alerts closed without action?
  • Mistake: Skipping follow-ups. Self-check: Do post-incident action items have owners and due dates?
  • Mistake: Blame culture. Self-check: Are reviews focused on systems and signals, not individuals?

Who this is for

  • Data Platform Engineers responsible for reliability, data quality, and SLAs.
  • Analytics Engineers and SREs supporting data infra.
  • Team leads defining on-call rotations and response standards.

Prerequisites

  • Basic knowledge of your data stack (orchestration, warehouse, monitoring).
  • Familiarity with alerting concepts and SLOs.
  • Access to team communication channels and incident tracker.

Learning path

  1. Define severities, SLOs, and alert triggers for top 5 pipelines.
  2. Create rotation and escalation policy; test a mock page.
  3. Write 3 runbooks and rehearse an incident drill.
  4. Measure MTTD/MTTR; prune noisy alerts monthly.
  5. Adopt blameless post-incident reviews with tracked action items.

Practical projects

  • Build a runbook library: three high-frequency failures with rollback steps and validation queries.
  • Incident drill: Simulate a Sev2 data quality breach; time each step and collect feedback.
  • Alert hygiene: Reduce alert volume by 30% without increasing MTTR.

Exercises

These mirror the exercises below. Do them here, then compare with the solutions.

Exercise 1: Draft an escalation policy and schedule

Company: 6-person data team, critical daily SLA at 7 AM local time, users across two time zones. Draft primary/secondary rotation, acknowledgement timing, and escalation steps. Include communication cadence.

Exercise 2: Write a runbook outline for a failed DAG due to schema change

Include detection, triage steps, rollback, hotfix/backfill, validation queries, and prevention items.

  • Checklist before you peek at solutions:
    • Defined severity levels and when to page
    • Named IC role and escalation timers
    • Explicit comms cadence and channels
    • Rollback and validation steps included

Mini challenge

In one paragraph, describe how you would reduce MTTR by 20% in the next quarter without adding headcount. Mention alert tuning, runbooks, and drills.

Next steps

  • Implement your escalation policy in your paging tool and run a mock incident.
  • Publish runbooks in a shared doc and schedule quarterly drills.
  • Add MTTR/MTTD metrics to your team dashboard and review monthly.
Note on progress

The quick test is available to everyone. Log in to save your progress and track results over time.

Practice Exercises

2 exercises to complete

Instructions

Scenario: A 6-person data team supports critical 7 AM SLAs in two time zones (UTC-5 and UTC+1). Create:

  • Primary and secondary weekly rotation.
  • Acknowledgement and escalation timers.
  • Severity mapping and who gets paged.
  • Communication cadence and status message template.
Expected Output
A concise policy document specifying rotation coverage windows, Sev1–Sev3 actions, ack and escalation timers, and a sample initial update.

Incident Management And On Call — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Incident Management And On Call?

AI Assistant

Ask questions about this tool