How to learn Dashboards And Alerts for Observability Platform in Platform Engineer for free

Why this matters

As a Platform Engineer, you translate system signals into insight and action. Clear dashboards help teams notice changes early. Reliable alerts wake the right person only when immediate action is needed. Together, they reduce outage time, speed up incident response, and build trust in your platform.

Typical tasks: define service health metrics, build Grafana/Kibana dashboards, design SLO-based alerts, route incidents by severity, and maintain runbooks.
Outcomes: fewer false pages, faster triage, and consistent operational standards across teams.

Who this is for

Platform/DevOps engineers standardizing observability.
Backend engineers on-call for services.
SREs aligning alerts with SLOs.

Prerequisites

Comfort with time-series metrics, logs, and traces basics.
Familiarity with Prometheus-style queries or similar (rates, histograms).
Basic understanding of HTTP status codes, latency percentiles, and error budgets.

Learning path

Define SLIs and SLOs that reflect user experience.
Draft core dashboard panels (traffic, errors, latency, saturation).
Add context panels (deploys, feature flags, dependencies).
Design two-stage alerts (warning vs critical) with time windows.
Implement routing, grouping, and noise controls (dedup, inhibition, silences).
Attach runbooks and test in a game day.

Concept explained simply

Dashboards answer “What is happening?” Alerts answer “Do I need to act now?”

Dashboards: curated panels that visualize key signals over time. Good dashboards reduce guesswork during incidents.
Alerts: rules that evaluate signals and notify humans or systems when predefined risk thresholds are crossed.

Mental model

Think of observability as a smoke alarm system for a building:

Sensors (metrics/logs) provide raw signals.
Panels (dashboards) show the current and historical smoke levels in different rooms.
Alarms (alerts) only ring when thresholds are crossed long enough to be meaningful, with details on where to go and what to check first.

Checklist: A good dashboard

Shows the four golden signals: traffic, errors, latency, saturation.
Highlights changes (deploys, config toggles).
Uses rates/ratios, not raw counters.
Surfaces percentiles (p95/p99), not only averages.
Explains units and time windows in panel titles.

Checklist: A good alert

Maps to user impact or clear risk.
Uses multi-minute windows to avoid flapping.
Has severity levels (warning vs critical).
Includes owner/team and a short runbook.
Routes to the right on-call and deduplicates by service.

Worked examples

Example 1: Traffic and error rate dashboard panels

Goal: See request volume and error rate (%) for the API service.

Request rate (RPS):

sum(rate(http_requests_total{job="api"}[5m]))

Error rate (% of 5xx):

100 * sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
  / sum(rate(http_requests_total{job="api"}[5m]))

Panel tips:

Use 5m rate for stability; annotate deploys.
Set thresholds: warning at > 2% for 15m; critical at > 5% for 5m.

Example 2: Latency p95 and heatmap

Goal: Show high-percentile latency and distribution.

p95 latency (seconds):

histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m])))

Heatmap: plot rate(http_request_duration_seconds_bucket{job="api"}[5m]) over le buckets. Add a reference line for SLO target (for example p95 < 300ms).

Example 3: SLO burn-rate alert (multi-window)

Goal: Page only when the error budget is burning too fast.

SLO: 99.9% over 30 days ⇒ error budget = 0.1%.
Burn factor 14.4 ≈ “consumes a day of budget in ~1.67 hours.”

PromQL pattern:

# Error rate windows
er_short = sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
          / sum(rate(http_requests_total{job="api"}[5m]))
er_long  = sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
          / sum(rate(http_requests_total{job="api"}[1h]))

# Page when both windows exceed burn threshold
critical = (er_short > 14.4 * 0.001) and (er_long > 14.4 * 0.001)

Turn this into an alert with labels:

alert: APIErrorBudgetBurnRateHigh
expr: ( (sum(rate(http_requests_total{job="api",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])))
      > 14.4 * 0.001 )
  and
      ( (sum(rate(http_requests_total{job="api",status=~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])))
      > 14.4 * 0.001 )
for: 5m
labels:
  severity: critical
  team: api
  service: api
annotations:
  summary: "API error budget is burning too fast"
  runbook: "Check recent deploys, rollback if needed; inspect p95 and dependencies."

Optional: CPU saturation alert

alert: NodeCpuSaturation
expr: avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.9
for: 15m
labels:
  severity: warning
  team: platform
annotations:
  summary: "High CPU for 15m (likely saturation)"
  runbook: "Inspect top processes; consider scaling or throttling noisy workloads."

Designing alerts that do not wake you unnecessarily

Alert on symptoms, not just causes. Prefer user-facing error rate or high latency over internal counters alone.
Use time windows to reduce flapping (for example 5m, 15m). Avoid single-sample triggers.
Separate severities: warning (action soon) vs critical (page now).
Group and deduplicate by service and team so one incident creates one page.
Inhibit related alerts when a parent alert is firing (for example, inhibit pod alerts when the node is down).
Respect silences and maintenance windows during planned work.
Every alert should have owner/team, service, severity, and a short runbook.

Minimal alert label set to standardize

team: owning team
service: logical service/component
severity: info|warning|critical
environment: prod|staging
runbook: short internal reference

Exercises

Do these in your lab environment or using sample data. Aim for clarity over perfection.

Exercise 1 — Build a minimal API reliability dashboard and two alerts

Tasks:

Create three panels for the API service: RPS, error rate %, and p95 latency.
Add a warning alert when error rate > 2% for 15 minutes; critical when > 5% for 5 minutes.
Include panel titles with units and windows (for example, “API p95 latency (5m window)” ).

Hints

Use rate over 5m for stability.
Error percentage = errors / total * 100.
p95 from histograms uses histogram_quantile over rate of buckets.

Exercise 2 — Design routing and noise controls

Tasks:

Choose grouping labels for notifications so related alerts merge.
Write an inhibition rule to suppress pod/container alerts when a node alert is active.
Propose a silence policy template for planned maintenance.
Add a short runbook annotation for your critical alert.

Hints

Group by team, service, severity; avoid high-cardinality labels like pod name.
Inhibition: target alerts are suppressed when a higher-level alert with matching labels fires.
Silence template: who, what, where, when, why.

Checklist to complete: panels show units; alerts have owner/team; warning vs critical thresholds; routing and inhibition documented; runbooks attached.

Common mistakes and self-check

Mistake: Alerting on raw counters. Fix: use rate() over a window.
Mistake: Averages only. Fix: include p95/p99 percentiles.
Mistake: Too many high-cardinality labels in grouping. Fix: group by service/team/severity.
Mistake: No runbook. Fix: add a 3–5 step starter playbook to each alert.
Mistake: Single-window alerts. Fix: add a short and long window for burn-rate.

Self-check questions

Can someone new to the team read your dashboard titles and understand units and windows?
If three pods fail in one node, do you get one page or many?
Does each critical alert include the first two diagnostic steps?

Practical projects

Project 1: Standard Service Dashboard. Build a reusable Grafana folder with golden-signal panels and annotations for deploys.
Project 2: SLO Pack. Define SLIs/SLOs for one critical service and implement dual-window burn-rate alerts.
Project 3: Noise Audit. Export alert history for 30 days, identify top noisy alerts, and reduce volume by 30% with grouping/inhibition.

Mini challenge

Pick one production service and remove one alert that never led to action. Replace it with a symptom-based alert tied to user impact. Document why it is better.

Next steps

Roll out standard labels and runbook templates across alerts.
Hold a monthly review of top pages: fix root causes or downgrade severity.
Run a 1-hour game day using your dashboard and alerts; refine based on findings.

Quick Test

Anyone can take the test for free. Logged-in users get their progress saved automatically.

Menu

Dashboards And Alerts

Table of Contents