Why this matters
When a dashboard goes blank, a metric spikes unexpectedly, or data stops refreshing, the business needs fast, repeatable recovery. A clear runbook turns panic into a routine: what to check, who to notify, and how to fix or mitigate. As a BI Analyst, you are often first to triage symptoms, confirm impact, and guide others through resolution.
- Protect revenue and decisions by restoring trusted data quickly.
- Reduce time-to-diagnosis with a known set of checks.
- Improve team coordination with predefined communication and roles.
- Create reusable knowledge for future incidents.
Concept explained simply
A runbook is a short, action-focused guide that tells you exactly what to do when a specific data or reporting issue occurs. Think of it as a checklist you follow when an alert fires or someone reports a problem.
Mental model
Use the cookbook model: trigger, ingredients, recipe, plating.
- Trigger: the symptom that starts the runbook (e.g., stale data alert).
- Ingredients: systems, credentials, queries, contacts.
- Recipe: triage steps, diagnostics, fix/rollback, communications.
- Plating: validation, incident closure, and post-incident notes.
Standard runbook template
Copy this template for any data incident
Title: [Short, specific] Owner: [Team/Person] Last updated: [YYYY-MM-DD] Severity levels: [S1 - Critical, S2 - High, S3 - Medium, S4 - Low] 1) Trigger & symptoms - When to use: [Alert name or symptom] - Typical signals: [e.g., dashboard error, freshness > X hours] - Impact: [Teams/processes affected] 2) Preconditions & tools - Access needed: [BI tool, warehouse, orchestration] - Dashboards/queries: [paths or names] - Contacts: [on-call analyst/engineer, product, support] 3) First 5 minutes (triage) - Acknowledge incident and set severity - Confirm symptom and timestamp - Pause risky actions (e.g., deployments) 4) Diagnostics (10–30 minutes) - Freshness checks: SELECT max(updated_at) ... - Volume/anomaly checks: SELECT count(*) ... - Pipeline status: [check last job run, logs] - Permissions/connectivity: [BI-to-DB connectivity] 5) Fix or workaround - Known fixes: [restart job, clear cache, re-run model] - Rollback: [how, when to use] - Workaround: [temporary data source or disable broken tile] 6) Communication - Status channel: [channel/name] - Update cadence: [e.g., every 30 minutes until resolved] - Templates: [initial, update, resolved] 7) Validation & close - Data checks: [freshness, counts, key metrics] - Stakeholder sign-off: [names/roles] - Post-incident: [root cause notes, actions, date]
Worked examples
Example 1: Dashboard showing stale data
Symptom: KPI dashboard claims last refresh 12 hours ago (threshold is 2 hours).
- First 5 minutes: Confirm freshness lag on the source table with SELECT now() - max(updated_at). Check orchestration tool for failed job.
- Diagnostics: If job failed, read error message. Check upstream source availability and credentials.
- Fix: Re-run failed job; if upstream is down, activate a fallback snapshot and place a banner on the dashboard.
- Communication: Notify channel with severity S2, ETA if known, and next update time.
- Validation: Ensure freshness under 2 hours and KPIs match expected ranges.
Mini runbook snippet
Trigger: Freshness alert > 2h Checks: max(updated_at), pipeline run logs Fix: Rerun job, clear BI cache, revalidate trend last 7 days Comms: Update every 30m until resolved, then final summary
Example 2: Pipeline failure causes row count drop
Symptom: Daily orders table has 70% fewer rows than typical weekday.
- First 5 minutes: Compare row count to 4-week weekday average.
- Diagnostics: Check deduping logic, late-arriving data, or failed upstream stage.
- Fix: Reprocess the affected partition; if dedupe is too aggressive, adjust key or temporarily relax until fix is deployed.
- Communication: Share the scope (date partitions affected) and expected recovery time.
- Validation: Recount rows, reconcile with source counts.
Mini runbook snippet
Trigger: Orders row count < 80% of baseline Checks: Baseline window, upstream stage status Fix: Reprocess date, adjust dedupe key, backfill Comms: Affected dates and business impact (dashboards relying on orders)
Example 3: Metric spike likely due to duplication
Symptom: Revenue spikes 30% in one day with no known event.
- First 5 minutes: Verify dimension and time filters match usual dashboard defaults.
- Diagnostics: Check duplicates in fact table, double loads, timezone boundary effects, or schema change in joins.
- Fix: Remove duplicate loads, correct join keys, re-run model; temporarily pin dashboards to previous correct partition.
- Communication: State suspicion of duplication, ask for business confirmation of campaigns if any.
- Validation: Compare recalculated revenue to previous baselines and finance extracts.
Mini runbook snippet
Trigger: Revenue day-over-day change > 20% without event Checks: Duplicates by natural key + date; join cardinality Fix: Purge double-loaded partition, re-run Comms: Flag provisional; confirm after validation
How to write a good runbook in 30 minutes
- Pick one frequent incident (e.g., stale dashboard).
- Define the trigger and severity mapping in one sentence each.
- List the 3 fastest triage checks you always run.
- Write the most common fix and a safe rollback or workaround.
- Paste communication templates and set an update cadence.
- Add validation steps and who signs off.
- Save, date it, and share the location with your team.
Quality checklist (open and tick)
- Trigger is specific and observable
- Owner and last updated date present
- First 5-minute triage steps are short and ordered
- Diagnostics include freshness, volume, and logs
- Clear fix, rollback, and workaround
- Communication channel, cadence, and templates
- Validation queries and sign-off
Communication templates
Initial incident message
[INCIDENT][S2] Stale KPI dashboard since 09:00 UTC Impact: Exec KPI dashboard shows data older than 6h Started: ~09:10 UTC; Trigger: freshness alert > 2h Actions: Triage started; checking pipeline status Next update: 30 minutes or sooner if resolved Owner: BI On-call
Update message
[UPDATE][S2] KPI dashboard Findings: Upstream load failed due to credential expiry ETA: Credentials refreshed; backfill running (~25m) Next update: 20 minutes
Resolved message
[RESOLVED][S2] KPI dashboard freshness Fix: Credentials rotated; backfill completed Validation: Freshness < 2h; KPIs match last week pattern Follow-up: Post-incident notes and actions by EOD
Common mistakes and how to self-check
- Vague triggers: Replace generic phrasing with measurable signals (e.g., freshness > 2h).
- Skipping the first 5-minute plan: Write 3 fastest checks; avoid deep-dive too early.
- No communication cadence: Define who, where, and how often to update.
- One-off fixes only: Always include workaround and rollback.
- Missing validation: List the exact queries and thresholds to confirm resolution.
- Stale runbooks: Add last updated date and a quarterly review reminder.
Self-check prompts
- Can a new teammate run this without asking me questions?
- Would this reduce time-to-diagnosis by at least 50%?
- Are the commands and queries copy-paste ready?
Exercises
Do these to practice documenting and using runbooks. The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1: Draft a stale-data runbook
Create a runbook for the case where a key dashboard shows data older than your freshness threshold.
Exercise 2: Incident communication pack
Write initial, update, and resolved messages plus an update cadence for a data pipeline outage.
Completion checklist
- Both exercises completed
- Owner and last updated date present
- Communication cadence clearly defined
- Validation queries included
Mini challenge
Scenario: A sales conversion metric drops to near zero post-deployment. In 5 bullet points, outline severity, first 3 checks, who you notify, and your rollback plan. Then write a 3-line initial incident message.
Who this is for
- BI Analysts and Analytics Engineers who support dashboards and metrics.
- Data-savvy PMs who coordinate incident response.
- Anyone on data on-call rotation.
Prerequisites
- Basic SQL (counts, max timestamps, simple joins).
- Familiarity with your BI tool and data pipeline orchestrator.
- Access to logs and job histories in your data platform.
Learning path
- Adopt the standard template and fill it for one common incident.
- Create communication templates and agree on a channel and cadence.
- Pilot the runbook in a mock incident and refine based on timing.
- Document 2 more incidents (pipeline failure, metric spike).
- Schedule quarterly runbook reviews.
Practical projects
- Runbook library: Publish 3 incident runbooks in a shared location with owners and review dates.
- Drill: Run a 30-minute tabletop exercise using your runbook and record timings.
- Alert-to-action: Map each alert to its runbook trigger and verify no gaps.
Next steps
- Finish the exercises and take the quick test to check your understanding.
- Share the template with your team and decide on review cadence.
- Add your runbook locations to the on-call handoff notes.