luvv to helpDiscover the Best Free Online Tools
Topic 8 of 9

Alerting And Dashboards

Learn Alerting And Dashboards for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

In production, great models fail quietly unless you watch them. Alerting and dashboards let you catch data drift, latency spikes, and quality regressions before customers or stakeholders do. Typical real-world tasks include:

  • Being on-call for the prediction API and getting paged only for real incidents, not noise.
  • Detecting data schema changes that break feature pipelines.
  • Spotting model drift early and triggering retraining or rollback.
  • Proving reliability with SLOs and error budgets to product and SRE teams.

Concept explained simply

Alerts are your fire alarms: they tell you when something needs immediate attention. Dashboards are your cockpit: they help you understand what is happening and why.

Mental model

  • Three layers to watch: Data → Model → System. Each needs a few clear health indicators.
  • Define SLIs (what you measure) and SLOs (targets). Alert when SLOs are at risk.
  • Reduce noise with aggregation windows and severities: Info (observe), Warning (soon bad), Critical (wake someone).

Key metrics and alert design

System health

  • Availability: success rate of requests (e.g., HTTP 2xx/total).
  • Latency: p50/p95/p99 request time.
  • Error rate: 5xx, timeouts, model load failures.
  • Throughput: QPS/RPS, queue depth.

Data health

  • Missing rate per feature; type/shape mismatches.
  • Freshness/lag: time since last feature update.
  • Drift: PSI or KL divergence versus a recent baseline.
  • Training–serving skew: distribution/mean differences between training and live.

Model quality

  • When labels are delayed: proxy metrics such as average confidence, fraction near decision boundary, score distribution shift by segment.
  • When labels arrive: rolling AUC/PR/MAE vs. baseline.

Designing good alerts

  • Trigger types: threshold (>=, <=), rate of change, anomaly vs. baseline.
  • Noise control: require a minimum duration (e.g., 5 of the last 10 minutes), group similar alerts, deduplicate, and set cool-downs.
  • Severity: Critical (page immediately), Warning (ticket/Slack), Info (dashboard only).
  • SLOs and error budgets: track budget burn rate; alert on fast burn rather than every breach.
  • Runbooks: each alert should say what to check and likely fixes.

Dashboard design that works under pressure

  • Top row: SLO summary cards (uptime, latency p95, error rate, data freshness).
  • Row 2 (System): Requests, p50/p95/p99 latency, error rate with breakdowns.
  • Row 3 (Data): Missing rates, freshness, top drifted features (PSI), schema changes.
  • Row 4 (Model): Score histogram, mean score by segment, decision boundary fraction, quality metric trend when labels arrive.
  • Row 5 (Rollouts): Canary vs. baseline comparisons, version adoption, rollback button status (if applicable).

Make it scannable with sparklines, clear units, and time range presets (last 1h, 6h, 24h, 7d). Prefer rates and percentiles over raw counts.

Worked examples

Example 1: Latency SLO

Goal: p95 latency <= 300 ms over 30 minutes.

  • Warning alert: p95 > 310 ms for 5 of the last 10 minutes.
  • Critical alert: p95 > 500 ms for 3 consecutive minutes or error rate > 2% for 5 minutes.
  • Dashboard: side-by-side charts for p50/p95/p99 latency, request rate, and error rate with version split.

Example 2: Data drift

Compute PSI for top features every 10 minutes vs. the last 7 days as baseline.

  • Info alert: any feature PSI ∈ [0.2, 0.3] for 30 minutes.
  • Warning alert: any feature PSI >= 0.3 for 30 minutes; include feature name and segment.
  • Critical alert: PSI >= 0.4 across at least 3 high-importance features for 15 minutes.
  • Dashboard: table of features with PSI, missing rate, freshness; click-to-expand histogram overlays.

Example 3: Delayed labels quality guardrail

Labels arrive daily; offline AUC baseline is 0.88 (7-day rolling).

  • Warning: AUC drop >= 0.02 below baseline for 2 days.
  • Critical: AUC drop >= 0.04 for 2 days or sharp segment-specific drop (e.g., high-value cohort).
  • Proxy guardrail: fraction of predictions within ±0.05 of decision threshold increases by >= 50% over 6 hours (model uncertainty spike).

Example 4: Alert fatigue reduction

  • Group alerts by service and version; dedupe identical alerts for 30 minutes.
  • Silence alerts during planned maintenance windows.
  • Use burn-rate alerts for SLOs (e.g., 2% error budget per hour and 5% per 5 minutes) to catch both slow and fast burns.

Who this is for and prerequisites

Who this is for

  • Machine Learning Engineers shipping and operating models.
  • Data Scientists owning experiments that go live.
  • SREs supporting ML services.

Prerequisites

  • Basic statistics (distributions, percentiles, divergence).
  • Familiarity with model lifecycle and serving.
  • Comfort with metrics concepts (rates, windows, baselines).

Learning path

  1. List your Data → Model → System metrics with clear definitions and units.
  2. Set SLOs for latency, availability, data freshness, and a model quality proxy.
  3. Draft alert rules with severities and anti-noise settings (durations, grouping).
  4. Design a one-page dashboard with the five-row layout above.
  5. Backfill 7–30 days of metrics to create baselines.
  6. Run synthetic incidents (traffic spike, schema change) and verify alerts fire once, with clear messages.
  7. Iterate monthly using post-incident reviews.

Exercises

Do the exercise below. The Quick Test at the end is available to everyone; only logged-in users get saved progress.

Exercise ex1 — Draft alerting and dashboard plan for a churn model API

Scenario: You serve a binary churn model via a REST endpoint. Labels arrive after 3–7 days. Traffic: ~200 RPS peak.

  1. Define 4–6 SLIs and SLOs (cover system, data, model proxy).
  2. Write 3 alert rules (Warning/Critical) with thresholds and durations.
  3. Specify anti-noise controls (grouping, dedupe, cool-down).
  4. Sketch a dashboard: rows, charts, key breakdowns.
  • Deliverable: A short spec anyone on-call can use.

Exercise checklist

  • Each SLI includes unit and window (e.g., p95 latency, 10m).
  • Each alert has severity, threshold, and minimum duration.
  • Data drift plan names the top features and the metric (e.g., PSI).
  • Dashboard top row has SLO cards; each chart has labels and segments.

Common mistakes and self-check

Mistakes

  • Alerting on raw counts instead of rates/percentiles, causing false alarms during traffic changes.
  • One global threshold for all segments; missing cohort-specific regressions.
  • No baseline for drift; alert flips during seasonality.
  • Only monitoring model metrics; ignoring data pipeline freshness and schema.
  • No deduplication or duration windows; paging storms.
  • Unclear alert messages without runbook steps.

Self-check

  • Can a new on-call person identify the likely cause from one alert message?
  • Can you simulate a schema change and get exactly one clear alert?
  • Does the dashboard answer: Is it system, data, or model?

Practical projects

  • Build a dummy scoring service that logs latency, errors, and prediction scores; compute p95, error rate, and score histogram.
  • Implement a drift job that calculates PSI hourly for five features against a rolling 7-day baseline; emit metrics.
  • Create a one-page dashboard with five rows (SLO, System, Data, Model, Rollout) and add a version comparison panel.
  • Run a game day: inject latency, increase missing values, and shift a feature; capture which alerts fired and update rules.

Next steps

  • Add burn-rate SLO alerts for reliability.
  • Integrate canary/rollout panels and automated rollback conditions.
  • Expand cohort dashboards for high-value segments.
  • Schedule monthly reviews to tune thresholds and runbooks.

Mini challenge

Your demand is highly seasonal (weekends spike). Design a drift and latency alerting scheme that avoids false positives. What baseline windows and thresholds will you choose? How will you handle holidays and product launches?

Take the quick test

Anyone can take the test below for free; only logged-in users get saved progress and badges.

Practice Exercises

1 exercises to complete

Instructions

Scenario: You operate a churn prediction REST API (binary classification). Labels arrive after 3–7 days. Peak traffic is ~200 RPS.

  1. Define 4–6 SLIs and SLOs with units and windows (cover system, data, model proxy).
  2. Create 3 alert rules (Warning/Critical) with thresholds and minimum durations.
  3. Specify anti-noise controls: grouping keys, dedup interval, cool-down/silence windows.
  4. Sketch a dashboard layout: rows and key charts. Include at least one segment breakdown.

Write your plan as a short spec that an on-call engineer can follow.

Expected Output
A concise plan listing SLIs/SLOs, alert rules with thresholds and durations, anti-noise settings, and a clear one-page dashboard layout.

Alerting And Dashboards — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Alerting And Dashboards?

AI Assistant

Ask questions about this tool