Why this matters
As an Analytics Engineer, you are responsible for keeping decision-critical data trustworthy. Quality dashboards show the health of pipelines; alerts notify the right people when something is off. Together, they reduce downtime, prevent bad decisions, and speed up recovery when incidents happen.
- Real tasks you will do: summarize test results, track freshness and volume, define alert thresholds, route alerts to the right channel, and write runbooks so anyone can fix common issues quickly.
- Outcomes: fewer surprises, faster incident response, clear visibility into SLAs, and confidence in core metrics.
Concept explained simply
A quality dashboard is a control panel for your data. It shows key signals: is data fresh, complete, valid, and consistent? Alerts are automated nudges that fire when a signal crosses a boundary (threshold or anomaly), so humans can act.
Mental model
Think of it like operating a kitchen:
- Prep lists (dashboards) show what is ready and what is missing.
- Kitchen timers (alerts) ring before things burn, not after.
- Recipes (runbooks) tell anyone how to fix a problem the same reliable way.
Common quality signals to track
- Freshness: time since last successful load.
- Volume: rows loaded vs expected.
- Completeness: required fields not null.
- Validity: values match business rules (e.g., amount >= 0).
- Uniqueness: no duplicate primary keys.
- Consistency: joins align across systems; schema unchanged.
- Test coverage: % models with tests; failures by severity.
- Alert response: time to acknowledge and resolve incidents.
What to measure on your dashboard
- Overview: % green models, open incidents, past 7-day failure trend.
- Freshness by domain: critical tables first; SLA vs actual lag.
- Volume and null rates: sudden drops/spikes and field-level completeness.
- Test failures: by severity (warn/error), by owner/domain.
- Schema changes: new/removed columns, breaking changes flagged.
- SLO tracking: uptime of critical data, breach counters.
Useful visuals
- Traffic-light cards: Freshness, Volume, Completeness.
- Sparkline trends: past 14 days failures.
- Waterfall: incident age and backlog.
- Bar chart: failures by data domain/owner.
Alert design (rules, routing, runbooks)
Choose a metric and condition. Example: “orders table freshness lag > 60 minutes for 2 consecutive checks.”
Warn (heads-up) or Error (action required). Route to on-call channel or owner group.
Message should include: table/model, environment, failing metric, last good run, suspected cause link (or hint), and runbook step.
Escalate if unresolved for X minutes, then notify broader team or create an incident ticket.
Alert hygiene checklist
- Clear title and one-line reason.
- Actionable: who does what next?
- Deduplicated: avoid noisy repeats.
- Suppressed during planned maintenance.
- Logs and job IDs included.
Worked examples
Example 1: Daily sales freshness
Goal: Ensure the sales_mart table is updated by 07:15 every day.
- Dashboard: freshness card shows “Lag: 12m, SLA: 15m, Status: OK”.
- Alert: if lag > 15m at 07:16, send Warning to #sales-data; if still > 15m at 07:30, escalate Error to on-call.
- Runbook: check last pipeline run ID, re-run step 3 (staging loads), verify warehouse credits and source API status.
Example 2: dbt test summary rollup
Goal: Reduce noise and highlight critical issues.
- Dashboard: tile “Critical errors: 2; Warnings: 7; Coverage: 86% models tested.”
- Alert: fire only if critical errors > 0 in core domain models; batch non-core warnings as a digest every 2 hours.
- Runbook: if PK uniqueness fails, quarantine duplicates with a hotfix model; backfill downstream marts after fix.
Example 3: Volume anomaly for web_events
Goal: Catch silent drops without brittle thresholds.
- Dashboard: 7-day sparkline of hourly events; anomaly band highlighted.
- Alert: trigger if actual count is 40% below 7-day median for 3 consecutive hours.
- Runbook: confirm tracking script deployment, compare by device type and region to isolate source, toggle fallback ingestion.
Example 4: SLA burn-down for finance close
Goal: Guarantee end-of-month data readiness.
- Dashboard: checklist of must-pass tests for finance models with due times.
- Alert: page finance data owner if any test remains failing at T-2 hours before close.
- Runbook: revert to last green snapshot and reprocess only impacted partitions.
Who this is for and prerequisites
- Who: Analytics Engineers, BI Developers, Data Engineers responsible for model reliability and stakeholder-facing dashboards.
- Prerequisites: basic SQL, familiarity with your orchestrator (e.g., scheduled jobs), understanding of data tests (freshness, not_null, unique), and access to monitoring metrics.
Learning path
- Map critical data assets: list tables, owners, SLAs, and required tests.
- Define alertable metrics: freshness, volume, null rate, test severity.
- Build the dashboard: start with 3–5 tiles; add trends and ownership.
- Add alerts + runbooks: choose channels, dedupe, and practice a drill.
Mini tasks while learning
- Write one freshness rule with escalation.
- Create a “top failing models” tile.
- Add owner and runbook link fields to your metadata table.
Exercises
These mirror the interactive tasks below. Do them to prepare for the quick test.
Exercise 1 — Design an alert rule
Write a concise alert configuration for orders table freshness with two-stage escalation and a clear, actionable message.
- Metric: freshness_lag_minutes
- Primary threshold: > 60 for 2 consecutive checks
- Escalation: if still failing after 180 minutes
- Include: severity, routing, message template, and a runbook pointer
# Draft your rule here as YAML-like textExercise 2 — SQL for completeness and volume
Write a SQL query that, for the last 24 hours of orders, outputs:
- row_count
- required_not_null_rate for order_id
- valid_amount_rate where amount >= 0
Expected: a single row with three numeric fields.
-- Write your SQL hereSelf-check checklist
- Is the alert message actionable (who/what/next)?
- Does the SQL filter to last 24 hours and cast divisions safely?
- Did you avoid alert loops (noisy repeats)?
Common mistakes and how to self-check
- No owners: Every tile and alert should list an owner. Self-check: can a newcomer tell who to ping?
- Flat thresholds only: Add trend-based or consecutive-checks logic to reduce noise.
- Missing runbooks: Alerts without steps create panic. Add the first three steps inline.
- Over-wide dashboards: Start small; prioritize critical metrics above the fold.
- No suppression windows: Mute during planned maintenance to avoid alert fatigue.
Practical projects
- Build a “Core Data Health” dashboard with 5 tiles: Freshness (3 tables), Volume anomaly, Critical test failures.
- Implement two alert rules with escalation and a shared runbook template.
- Create a daily digest job summarizing non-critical warnings to reduce noise.
Mini challenge
Pick one critical table. Add a freshness card to your dashboard, create a two-stage alert with a 60-minute threshold, and write a 5-step runbook. Time-box to 60 minutes.
Tip if you get stuck
Start with the last successful load timestamp and compute lag in minutes. If you don’t have a metric table, materialize one with a simple SELECT now() - max(loaded_at).
Next steps
- Expand from freshness/volume to field-level validity tests for key dimensions and facts.
- Add incident metrics: time-to-acknowledge and time-to-resolve.
- Schedule a monthly “alert hygiene” review: remove dead alerts, tune thresholds, and update runbooks.
Quick Test note: Everyone can take the test for free. Only logged-in users have their progress saved.