Scenario: DAG orders_daily fails at task load_to_warehouse. Retries exhausted. You must stabilize, restore, and communicate.
Write a one-page runbook that includes these sections:
- Trigger & impact
- Severity mapping & escalation rules
- Diagnosis steps (top 3 checks)
- Mitigations (safe actions)
- Rollback & backfill steps (with parameters)
- Validation queries
- Communication template
- Owners & last updated
Template to fill:
Runbook: Airflow task failure (orders_daily.load_to_warehouse)
Trigger:
Impact:
Severity:
Escalate when:
Diagnosis:
Mitigations:
Rollback & Backfill:
Validation:
Communication:
Owners:
Last updated: