Why this matters
Dashboards tell you what happened; alerts tell you when to act. As a BI Analyst, you will translate business risk into monitoring rules so stakeholders are notified quickly and calmly when metrics drift, pipelines fail, or SLAs are threatened.
- Catch revenue dips or conversion issues within hours, not days.
- Surface data freshness or completeness problems before executives see stale numbers.
- Prevent alert fatigue with well-tuned thresholds and clear severity levels.
Concept explained simply
An alert is a focused notification triggered by a rule. A monitor is the ongoing check behind that alert. Dashboards visualize; monitors watch; alerts nudge.
Mental model
Think of a thermostat: you set a temperature (threshold), it checks periodically (monitor), and it only turns on when conditions warrant (alert). Hysteresis/cooldowns reduce flipping on/off. Sensor quality (your data quality) determines trust.
Common alerting and monitoring use cases
- Threshold breach: metric below/above a fixed or baseline value.
- Change detection: sudden drop/spike vs recent trend.
- Anomaly detection: outside of expected band (rolling mean/median ± k*std or robust IQR).
- Windowed checks: hourly/daily sums and moving averages to smooth noise.
- Funnel health: step conversion or abandonment spikes.
- Data freshness/completeness: late loads, row count gaps, null surges, schema changes.
- SLA/latency: pipeline or dashboard refresh takes too long.
- Budget/pacing: spend or usage deviates from plan.
Worked examples
Example 1: Revenue drop alert (threshold + baseline)
- Metric: Hourly revenue (UTC), with guard for min orders ≥ 50 per hour.
- Baseline: Median of same weekday and hour across the past 4 weeks.
- Rule: Trigger if revenue < 85% of baseline for 3 consecutive hours.
- Severity:
- High: < 70% for 2+ hours (page team + finance)
- Medium: 70–85% for 3+ hours (growth + ops)
- Cooldown: 3 hours after firing unless condition worsens.
- Message example: "Revenue $48k vs baseline $70k (−31%, last 3h). Min orders guard passed. Probable impact: homepage A/B or payments. Runbook: Revenue-Drop."
Example 2: Conversion rate anomaly (rolling band)
- Metric: Checkout conversion rate (orders / sessions), per device.
- Smoothing: Compute CR on a 6-hour rolling window; ignore windows with sessions < 1,000.
- Band: Rolling median ± 3*rolling MAD (robust to outliers).
- Rule: Alert if CR is below lower band for 2 consecutive windows on both mobile and desktop (2-of-2 gate).
- Reason: Reduces false positives from a single noisy segment.
Example 3: Pipeline latency SLA
- Metric: Data freshness = now − max(event_time) for core tables.
- SLA: Freshness ≤ 90 minutes from 06:00–23:00 local time.
- Rule:
- Warn at 75 minutes; escalate at 120 minutes.
- Suppress during planned maintenance windows.
- Action: Auto-create incident note in dashboard header: "Data delayed; numbers may be incomplete."
Designing effective alert rules
- Define the business impact: what changes if this fires?
- Pick a trustworthy metric and unit (sum, rate, ratio) and add volume guards.
- Choose baseline:
- Static target: e.g., revenue ≥ $100k/day (simple, brittle).
- Seasonal baseline: same weekday/hour, last 4–8 weeks.
- Robust trend: rolling median or EWMA to smooth shock noise.
- Set windowing to reduce noise (e.g., 3-hour rolling sum).
- Define severity levels and escalation paths.
- Prevent spam: cooldowns, deduplication, and quiet hours.
- Explain the rule in plain language in the alert text.
Threshold recipes you can reuse
- Percent-of-baseline: fire if metric < X% of baseline for N windows.
- Change-rate: fire if |current − baseline| / baseline > Y% and volume ≥ min_events.
- Robust band: median ± k*MAD with k between 3 and 5 for ratios.
- 2-of-3 gate: require confirmation across devices, regions, or channels.
Prevent alert fatigue
- Use windows and minimum volume guards.
- Add cooldowns and deduplicate repeated alerts with the same fingerprint.
- Set different thresholds for segments with different variability.
- Mute during planned campaigns or maintenance.
- Prioritize business impact over minor fluctuations.
Pre-flight checklist (before enabling any alert)
- Clear owner and on-call rotation for this metric.
- Plain-language trigger condition and example values.
- Runbook name with first steps to diagnose.
- Baseline logic documented and visible on the dashboard.
- Test alert fired in a sandbox and reviewed.
Monitor data quality
- Freshness: max(event_time) is within SLA.
- Completeness: expected rows vs actual; gap thresholds.
- Validity: null rate, out-of-range, negative values.
- Uniqueness: primary key dupes.
- Schema drift: added/removed columns.
Sample data-quality rules
- orders.freshness_minutes ≤ 90 (warn at 75, critical at 120)
- orders null_rate(payment_status) ≤ 0.5%
- events daily_rowcount ≥ 95% of median(last 14 same weekdays)
- users duplicate_ids = 0
Dashboard design patterns for monitoring
- Status banner at top: "Systems normal" or short issue summary.
- Metric cards with threshold coloring and small context note: baseline, window, last checked time.
- Sparklines with shaded expected band; recent points outside band highlighted.
- Tooltips or footnotes that explain how thresholds are calculated.
- Incident annotation strip showing when alerts fired.
Exercises
Exercise 1: Design an alerting and monitoring spec for a revenue dashboard
You own an e-commerce revenue dashboard with hourly updates. Build an alerting spec that catches real revenue issues while avoiding noise.
- Choose metrics: revenue, orders, conversion rate. Add guards for volume.
- Select baselines for each (e.g., same weekday/hour median of last 4–8 weeks).
- Define rules, windows, severities, cooldowns, and planned mute windows.
- Add data-quality monitors for freshness, completeness, and nulls.
- Write the alert message template in plain language.
Deliverable checklist:
- Trigger conditions for each metric
- Baseline math and window size
- Severity mapping and routing
- Cooldown and dedup policy
- Runbook first steps
Common mistakes and self-check
- Using static thresholds on seasonal metrics. Fix: same weekday/hour baselines.
- Alerting on ratios without volume guards. Fix: require min sessions/orders.
- No cooldowns. Fix: add time-based suppression after firing.
- Too many owners or none. Fix: assign a single accountable team.
- Hidden logic. Fix: display the rule and baseline on the dashboard.
Self-check
- Can you explain your rule to a non-analyst in one sentence?
- Does a 1-hour data delay falsely trigger anything?
- Would this alert have caught the last real incident you had?
Practical projects
- Implement a status banner with data freshness, last load time, and row count indicators.
- Create an anomaly band chart for a key KPI using rolling median and MAD.
- Build a weekly alert review dashboard: fired counts, false positives, mean time to acknowledge, top noisy rules.
Learning path
- Before this: KPI selection, baselining, and seasonality understanding.
- This module: Alerting and monitoring use cases on dashboards.
- Next: Runbooks and incident workflows; annotation of incidents on dashboards; scheduled health reviews.
Who this is for
- BI Analysts building production dashboards for business stakeholders.
- Analysts responsible for KPI health and on-call triage during business hours.
Prerequisites
- Comfort with KPI definitions and basic statistics (median, MAD, standard deviation).
- Understanding of your data refresh pipeline and typical delays.
- Access to your BI tool’s scheduling/refresh settings.
Mini challenge
Your marketing team launched a big campaign that doubled traffic for 24 hours. Your conversion rate alert fired repeatedly. Adjust the rule to avoid false positives during planned traffic spikes while still catching real conversion issues. Write the revised rule and a 1-sentence justification.
Next steps
- Pick one KPI and implement a single well-tuned alert this week.
- Add a dashboard status banner with freshness and last incident note.
- Schedule a monthly alert review to prune noisy rules.
Quick Test
Available to everyone; progress is saved only for logged-in users.