How to learn Metrics And Dashboards for Observability And Operations in Backend Engineer for free

Why this matters

As a Backend Engineer, you ship services that must stay healthy. Metrics and dashboards are how you see reality: capacity trends, latency spikes, error bursts, and release impact. You will use them to:

Investigate incidents in minutes, not hours.
Track SLOs and error budgets to guide releases.
Plan capacity and scale before users feel pain.
Prove impact of optimizations with before/after charts.

Concept explained simply

Metrics are numeric measurements over time. A dashboard is a curated set of graphs answering a specific question (e.g., “Is the API healthy?”). You typically collect counters (ever-increasing totals), gauges (current values), and histograms (distributions like latency). You query them to show rates, percentiles, and ratios.

SLIs are measurements that reflect user experience (e.g., request success rate, latency). SLOs are targets for SLIs (e.g., 99.9% success monthly). Error budgets are the allowed failure (e.g., 0.1%). Dashboards help you track these and act fast.

Mental model

Think of your service as a car:

The speedometer is your request rate.
Warning lights are your error rates.
Engine temperature is latency percentiles.
Fuel gauge is resource saturation (CPU, memory, queue lag).

Your mission: design a clean dashboard so anyone can drive safely at a glance.

Core building blocks

Metric types:
- Counter: increases only (e.g., total_requests). Use rate() over time windows.
- Gauge: current value (e.g., in_flight_requests, memory_used_bytes).
- Histogram: distribution via buckets (e.g., request_duration_seconds_bucket).
Labels/tags: key-value dimensions (method, status, region). Keep cardinality low; avoid user IDs or UUIDs.
Units: use base units and suffixes (seconds, bytes, ratio). Name metrics clearly (e.g., api_request_duration_seconds).
Golden signals: Traffic, Errors, Latency, Saturation (TELS/RED-USE patterns).
Percentiles: p50 (typical), p90 (slow), p99 (tail). Tail drives user pain and incident pages.

Quick reference: when to use what

Total counts over time: Counter + rate()/irate().
Current level: Gauge.
Latency: Histogram + histogram_quantile() for p50/p90/p99.
Ratios: divide two rates (e.g., error_rate / request_rate).

Worked examples

Example 1: API health overview

Goal: Above-the-fold, decide in 10 seconds if the API is healthy.

Request rate (RPS):
```
sum(rate(api_requests_total[5m]))
```

Error rate (%):

(sum(rate(api_requests_total{status=~"5.."}[5m])) / sum(rate(api_requests_total[5m]))) * 100

Latency percentiles (p50/p90/p99):

histogram_quantile(0.50, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))

Interpretation: RPS stable, errors < 1%, p99 below your SLO threshold → healthy.

Example 2: Queue worker saturation

Goal: See if workers keep up with incoming jobs.

Ingress rate:

sum(rate(queue_jobs_enqueued_total[5m]))

Processing rate:

sum(rate(queue_jobs_processed_total[5m]))

Backlog depth (gauge):
```
sum(queue_jobs_ready)
```

Backlog drain time (rough):

sum(queue_jobs_ready) / max(sum(rate(queue_jobs_processed_total[5m])), 1)

Interpretation: If drain time grows and processing rate < ingress, add workers or optimize.

Example 3: SLO tracking and burn rate

SLO: 99.9% success monthly → error budget 0.1%.

Instant error ratio (5m):

sum(rate(api_requests_total{status=~"5.."}[5m])) / sum(rate(api_requests_total[5m]))

Burn rate (fast window 5m vs slow window 1h):

(sum(rate(api_requests_total{status=~"5.."}[5m])) / sum(rate(api_requests_total[5m]))) /
0.001

Interpretation: Burn rate >= 14 over 5m usually merits a page; sustained lower burn on slow window suggests a ticket. Calibrate to your policy.

Designing effective dashboards

First row = Golden signals (RPS, errors %, p50/p99 latency, saturation).
Group by question: “Is it healthy?” then “Where is it broken?” then “Why is it broken?”
Use consistent colors and units; add axis units and thresholds.
Avoid dense label explosions; prefer top-N breakouts (method, endpoint) with others aggregated.
Show release annotations and incident notes to correlate changes with spikes.
Pick time windows intentionally: 15m for immediate shape, 6–24h for trend context.

Step-by-step: build a minimal SRE dashboard

Add Golden signals: RPS, error %, p50/p99 latency on one row.
Add saturation: CPU %, memory, connection pool, queue lag.
Add breakdowns: Errors by status, latency by endpoint (top 5 by RPS).
Add SLO row: SLI, budget remaining, burn rate (fast/slow windows).
Add diagnostics: DB latency, external dependency errors, cache hit rate.

Checklist: ready for on-call?

I can tell in 10 seconds if the service is healthy.
I can see top error sources and slow endpoints.
I can estimate capacity headroom.
I can track SLO and burn rate clearly.

Exercises

Do these in your metrics system or using the queries conceptually. Compare with the expected outputs. Hints and full solutions are provided.

Exercise 1 — API latency and errors dashboard row

Task: Create a dashboard row for an HTTP API with:

RPS (5m rate of total requests).
Error percentage (5xx over total).
Latency p50, p90, p99 from a latency histogram.

Hints

Use rate() on counters, and histogram_quantile() on histogram buckets.
Aggregate buckets with sum(...) by (le) before quantiles.
Keep units as seconds and show thresholds consistent with your SLO.

Expected output shape

sum(rate(api_requests_total[5m]))
((sum(rate(api_requests_total{status=~"5.."}[5m])) / sum(rate(api_requests_total[5m]))) * 100)
histogram_quantile(0.50, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))

Exercise 2 — Queue backlog drain time

Task: Show if your workers can drain the backlog within 15 minutes.

Chart ingress rate and processing rate (5m).
Compute drain time = backlog / max(processing_rate, 1).
Add a threshold at 900 seconds (15m).

Hints

Backlog is usually a gauge (ready jobs).
Use consistent label filters so ingress and processed rates cover the same jobs.
If drain time increases while processing rate is flat, add workers or optimize.

Expected output shape

sum(rate(queue_jobs_enqueued_total[5m]))
sum(rate(queue_jobs_processed_total[5m]))
sum(queue_jobs_ready) / max(sum(rate(queue_jobs_processed_total[5m])), 1)

Common mistakes and self-check

High-cardinality labels: user_id, request_id in labels can explode memory. Self-check: top 10 label value counts; remove IDs.
Wrong metric type: using a gauge for totals breaks rate math. Self-check: total-only metrics should never decrease.
Mixed units: ms vs s causes confusion. Self-check: verify metric names end with _seconds, _bytes, or _ratio.
Percentiles without traffic: p99 on low traffic is noisy. Self-check: pair percentiles with RPS.
Dashboard overload: 40 panels on one page hides the signal. Self-check: first row answers “healthy?” in 10 seconds.

Practical projects

Instrument a sample HTTP service with request counter, error counter, latency histogram; build the Golden signals row.
Create a worker dashboard showing ingress, processing, backlog, drain time, and top failure causes.
Define an SLO for your most important endpoint and add an SLO row with budget burn indicators.

Who this is for

Backend Engineers building and operating services.
Platform/SRE engineers standardizing observability.
Developers on-call who need fast, reliable diagnostics.

Prerequisites

Basic service monitoring knowledge (counters, gauges).
Comfort reading time-series graphs.
Familiarity with one metrics stack (e.g., Prometheus-style concepts).

Learning path

Before: service instrumentation basics.
This subskill: selecting metrics, writing queries, building dashboards for decisions.
Next: alerting on SLOs and burn rates, logs and traces for deep dives.

Next steps

Standardize metric names and units across services.
Create a shared dashboard template for new services.
Review dashboards during post-incident to fill blind spots.

Note: The quick test on this page is available to everyone. Only logged-in users will have their progress saved.

Mini challenge

Pick one critical endpoint. Add p50/p90/p99 latency, error %, and RPS panels. Set thresholds matching your SLO. Share with your team and ask: can we tell it’s healthy in 10 seconds?

Menu

Metrics And Dashboards

Table of Contents