Why this matters
Automated checks and alerts keep dashboards trustworthy without manual babysitting. As a BI Analyst, you will:
- Catch broken pipelines early (freshness/volume checks) before stakeholders notice.
- Detect data drift (spikes in nulls or outliers) that can mislead reports.
- Escalate issues with clear, actionable alerts and reduce alert fatigue.
- Log results for audits and continuous improvement.
Concept explained simply
An automated check is a small test that runs on a schedule to verify data assumptions (like “no null emails” or “orders arrive hourly”). An alert is a message sent when a check breaks a threshold.
Mental model
Think of a building with smoke detectors:
- Detector = Check logic (e.g., null rate > 1%).
- Polling = Schedule (e.g., hourly).
- Alarm = Alert sent to the right people with context.
- Logbook = Monitoring table that stores every run.
Core building blocks
1) Check types you’ll use often
- Freshness: last data timestamp within X time.
- Volume: row count within expected range.
- Null/duplicate rate: percent of bad records below threshold.
- Range/constraint: values within valid bounds, referential integrity holds.
- Drift: distribution change versus baseline (simple: mean/median shift).
2) Thresholds
- Static: fixed limits (e.g., nulls < 1%).
- Dynamic: based on history (moving average, MAD, seasonality).
3) Scheduling
- Align with data arrival (e.g., 15 minutes after ETL).
- Use maintenance windows to mute during known outages.
4) Alert routing
- Severity levels (info, warning, critical).
- Channels (email/ChatOps/on-call). Send only what matters to each audience.
5) Logging
- Write every run to a monitoring table: check_name, status, metrics, runtime, error_message.
- Track false positives to tune thresholds.
Worked examples
Example 1 — Freshness check (orders)
Goal: Confirm the latest ingested record is not older than 2 hours.
-- Postgres-style SQL
SELECT
MAX(ingested_at) AS last_ingest,
NOW() AS checked_at,
EXTRACT(EPOCH FROM NOW() - MAX(ingested_at))/3600.0 AS hours_delayed,
CASE WHEN MAX(ingested_at) < NOW() - INTERVAL '2 hours' THEN TRUE ELSE FALSE END AS is_stale
FROM analytics.orders;
Alert if is_stale = TRUE. Include hours_delayed and affected dashboards in the alert message.
Example 2 — Null rate check (customer_email)
Goal: Keep null emails below 1% for yesterday’s data.
WITH stats AS (
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN customer_email IS NULL OR customer_email = '' THEN 1 ELSE 0 END) AS nulls
FROM analytics.customers
WHERE created_at >= CURRENT_DATE - INTERVAL '1 day'
AND created_at < CURRENT_DATE
)
SELECT
total_rows,
nulls,
ROUND(100.0 * nulls / NULLIF(total_rows, 0), 2) AS null_rate_pct,
CASE WHEN 100.0 * nulls / NULLIF(total_rows, 0) > 1.0 THEN TRUE ELSE FALSE END AS breach
FROM stats;
Alert if breach = TRUE. Guard against divide-by-zero with NULLIF.
Example 3 — Volume anomaly (moving baseline)
Goal: Flag today’s order count if it deviates from the last 7 days by more than 3 standard deviations.
WITH daily AS (
SELECT
DATE(created_at) AS d,
COUNT(*) AS cnt
FROM analytics.orders
WHERE created_at >= CURRENT_DATE - INTERVAL '8 days'
GROUP BY 1
),
baseline AS (
SELECT AVG(cnt) AS mean_cnt, STDDEV_SAMP(cnt) AS sd
FROM daily
WHERE d < CURRENT_DATE
),
today AS (
SELECT cnt FROM daily WHERE d = CURRENT_DATE
)
SELECT
t.cnt AS today_cnt,
b.mean_cnt,
b.sd,
CASE WHEN b.sd IS NULL THEN FALSE
WHEN t.cnt < b.mean_cnt - 3*b.sd OR t.cnt > b.mean_cnt + 3*b.sd THEN TRUE
ELSE FALSE END AS is_anomaly;
If your data is sparse, use median/MAD or a ±20% band instead of standard deviation.
Design alerts people read
- Keep messages short and actionable.
- Include: what broke, severity, metric value vs. threshold, scope (table/partition), suspected impact, and owner.
- Link to your monitoring dashboard or runbook name (mention, don’t include URLs here).
Alert template
[CRITICAL] Freshness breach — analytics.orders
Observed: hours_delayed=3.4 (threshold <= 2)
Scope: table=analytics.orders, partition=today
Impact: Order dashboards may be stale
Action: Run job 'orders_ingest' then re-run check 'orders_freshness'
Owner: Data Platform On-call
Setup steps (quick start)
- Pick 3 checks: freshness (core tables), volume (daily), null rate (key columns).
- Create a monitoring table with columns: check_name, run_at, status, metric, threshold, details, error_message.
- Write SQL for each check; insert a row into the monitoring table with the result and status.
- Schedule checks right after ETL completes; add maintenance windows if needed.
- Route alerts by severity: warnings to channel, critical to on-call.
- Dry-run for a week; adjust thresholds to cut false positives.
Exercises
These mirror the tasks below in the Exercises panel. Do them in SQL or your BI warehouse.
Exercise 1 — Duplicate rate (orders)
Write a query to compute duplicates for order_id in the last 24 hours. Return:
- duplicate_count
- duplicate_rate_pct (0–100)
Alert if duplicate_rate_pct > 0.10 (0.10%).
- Checklist:
- Uses a 24-hour time filter.
- Handles empty sets safely.
- One-row numeric output.
Exercise 2 — Dynamic volume threshold
Create a query that flags today as anomaly if today’s row count deviates more than 20% from the average of the previous 7 days.
- Checklist:
- Excludes today from baseline.
- Computes band = mean ± 20%.
- Outputs is_anomaly boolean.
Common mistakes and self-check
- Missing time windows: Checks run before data lands. Self-check: Is schedule offset from ETL finish?
- Hard thresholds everywhere: Leads to alert fatigue. Self-check: Do at least volume checks use a baseline?
- No logs: Only notifications, no history. Self-check: Can you show a 30-day trend per check?
- Unhandled errors: Query failure means no alert. Self-check: Do you alert on check errors as critical?
- Vague alerts: No metric values or actions. Self-check: Does your message include metric vs threshold and next steps?
Who this is for and prerequisites
- Who: BI Analysts, Analytics Engineers, Data-savvy PMs who own dashboards.
- Prerequisites: Basic SQL (GROUP BY, CASE), understanding of your ETL schedule, access to the warehouse.
Practical projects
- Project 1: Monitoring table + 3 core checks (freshness, volume, null rate) for your top two tables.
- Project 2: Weekly alert digest summarizing all breaches, top noisy checks, and tuning suggestions.
- Project 3: Severity-based routing with maintenance windows, plus a simple runbook per check.
Learning path
- Start: Static thresholds on key checks.
- Next: Dynamic baselines (moving average, median/MAD).
- Then: Partition-level checks (per region/day) and referential integrity checks.
- Advanced: Seasonality-aware baselines and incident postmortems.
Mini challenge
Pick one dashboard your stakeholders use daily. Implement two checks that protect it (freshness and volume), add a clear alert message, and run them for 7 days. Tune thresholds to reduce false positives by at least 50%.
Next steps
- Do the exercises below, then take the Quick Test at the end.
- Note: The test is available to everyone; only logged-in users will see saved progress.