Why this matters
As a Backend Engineer, you ship services that must stay healthy. Metrics and dashboards are how you see reality: capacity trends, latency spikes, error bursts, and release impact. You will use them to:
- Investigate incidents in minutes, not hours.
- Track SLOs and error budgets to guide releases.
- Plan capacity and scale before users feel pain.
- Prove impact of optimizations with before/after charts.
Concept explained simply
Metrics are numeric measurements over time. A dashboard is a curated set of graphs answering a specific question (e.g., “Is the API healthy?”). You typically collect counters (ever-increasing totals), gauges (current values), and histograms (distributions like latency). You query them to show rates, percentiles, and ratios.
SLIs are measurements that reflect user experience (e.g., request success rate, latency). SLOs are targets for SLIs (e.g., 99.9% success monthly). Error budgets are the allowed failure (e.g., 0.1%). Dashboards help you track these and act fast.
Mental model
Think of your service as a car:
- The speedometer is your request rate.
- Warning lights are your error rates.
- Engine temperature is latency percentiles.
- Fuel gauge is resource saturation (CPU, memory, queue lag).
Your mission: design a clean dashboard so anyone can drive safely at a glance.
Core building blocks
- Metric types:
- Counter: increases only (e.g., total_requests). Use
rate()over time windows. - Gauge: current value (e.g., in_flight_requests, memory_used_bytes).
- Histogram: distribution via buckets (e.g., request_duration_seconds_bucket).
- Counter: increases only (e.g., total_requests). Use
- Labels/tags: key-value dimensions (method, status, region). Keep cardinality low; avoid user IDs or UUIDs.
- Units: use base units and suffixes (seconds, bytes, ratio). Name metrics clearly (e.g.,
api_request_duration_seconds). - Golden signals: Traffic, Errors, Latency, Saturation (TELS/RED-USE patterns).
- Percentiles: p50 (typical), p90 (slow), p99 (tail). Tail drives user pain and incident pages.
Quick reference: when to use what
- Total counts over time: Counter +
rate()/irate(). - Current level: Gauge.
- Latency: Histogram +
histogram_quantile()for p50/p90/p99. - Ratios: divide two rates (e.g., error_rate / request_rate).
Worked examples
Example 1: API health overview
Goal: Above-the-fold, decide in 10 seconds if the API is healthy.
- Request rate (RPS):
sum(rate(api_requests_total[5m])) - Error rate (%):
(sum(rate(api_requests_total{status=~"5.."}[5m])) / sum(rate(api_requests_total[5m]))) * 100 - Latency percentiles (p50/p90/p99):
histogram_quantile(0.50, sum(rate(api_request_duration_seconds_bucket[5m])) by (le)) histogram_quantile(0.90, sum(rate(api_request_duration_seconds_bucket[5m])) by (le)) histogram_quantile(0.99, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))
Interpretation: RPS stable, errors < 1%, p99 below your SLO threshold → healthy.
Example 2: Queue worker saturation
Goal: See if workers keep up with incoming jobs.
- Ingress rate:
sum(rate(queue_jobs_enqueued_total[5m])) - Processing rate:
sum(rate(queue_jobs_processed_total[5m])) - Backlog depth (gauge):
sum(queue_jobs_ready) - Backlog drain time (rough):
sum(queue_jobs_ready) / max(sum(rate(queue_jobs_processed_total[5m])), 1)
Interpretation: If drain time grows and processing rate < ingress, add workers or optimize.
Example 3: SLO tracking and burn rate
SLO: 99.9% success monthly → error budget 0.1%.
- Instant error ratio (5m):
sum(rate(api_requests_total{status=~"5.."}[5m])) / sum(rate(api_requests_total[5m])) - Burn rate (fast window 5m vs slow window 1h):
(sum(rate(api_requests_total{status=~"5.."}[5m])) / sum(rate(api_requests_total[5m]))) / 0.001
Interpretation: Burn rate >= 14 over 5m usually merits a page; sustained lower burn on slow window suggests a ticket. Calibrate to your policy.
Designing effective dashboards
- First row = Golden signals (RPS, errors %, p50/p99 latency, saturation).
- Group by question: “Is it healthy?” then “Where is it broken?” then “Why is it broken?”
- Use consistent colors and units; add axis units and thresholds.
- Avoid dense label explosions; prefer top-N breakouts (method, endpoint) with others aggregated.
- Show release annotations and incident notes to correlate changes with spikes.
- Pick time windows intentionally: 15m for immediate shape, 6–24h for trend context.
Step-by-step: build a minimal SRE dashboard
- Add Golden signals: RPS, error %, p50/p99 latency on one row.
- Add saturation: CPU %, memory, connection pool, queue lag.
- Add breakdowns: Errors by status, latency by endpoint (top 5 by RPS).
- Add SLO row: SLI, budget remaining, burn rate (fast/slow windows).
- Add diagnostics: DB latency, external dependency errors, cache hit rate.
Checklist: ready for on-call?
Exercises
Do these in your metrics system or using the queries conceptually. Compare with the expected outputs. Hints and full solutions are provided.
Exercise 1 — API latency and errors dashboard row
Task: Create a dashboard row for an HTTP API with:
- RPS (5m rate of total requests).
- Error percentage (5xx over total).
- Latency p50, p90, p99 from a latency histogram.
Hints
- Use
rate()on counters, andhistogram_quantile()on histogram buckets. - Aggregate buckets with
sum(...) by (le)before quantiles. - Keep units as seconds and show thresholds consistent with your SLO.
Expected output shape
sum(rate(api_requests_total[5m]))
((sum(rate(api_requests_total{status=~"5.."}[5m])) / sum(rate(api_requests_total[5m]))) * 100)
histogram_quantile(0.50, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(api_request_duration_seconds_bucket[5m])) by (le))Exercise 2 — Queue backlog drain time
Task: Show if your workers can drain the backlog within 15 minutes.
- Chart ingress rate and processing rate (5m).
- Compute drain time = backlog / max(processing_rate, 1).
- Add a threshold at 900 seconds (15m).
Hints
- Backlog is usually a gauge (ready jobs).
- Use consistent label filters so ingress and processed rates cover the same jobs.
- If drain time increases while processing rate is flat, add workers or optimize.
Expected output shape
sum(rate(queue_jobs_enqueued_total[5m]))
sum(rate(queue_jobs_processed_total[5m]))
sum(queue_jobs_ready) / max(sum(rate(queue_jobs_processed_total[5m])), 1)Common mistakes and self-check
- High-cardinality labels: user_id, request_id in labels can explode memory. Self-check: top 10 label value counts; remove IDs.
- Wrong metric type: using a gauge for totals breaks rate math. Self-check: total-only metrics should never decrease.
- Mixed units: ms vs s causes confusion. Self-check: verify metric names end with
_seconds,_bytes, or_ratio. - Percentiles without traffic: p99 on low traffic is noisy. Self-check: pair percentiles with RPS.
- Dashboard overload: 40 panels on one page hides the signal. Self-check: first row answers “healthy?” in 10 seconds.
Practical projects
- Instrument a sample HTTP service with request counter, error counter, latency histogram; build the Golden signals row.
- Create a worker dashboard showing ingress, processing, backlog, drain time, and top failure causes.
- Define an SLO for your most important endpoint and add an SLO row with budget burn indicators.
Who this is for
- Backend Engineers building and operating services.
- Platform/SRE engineers standardizing observability.
- Developers on-call who need fast, reliable diagnostics.
Prerequisites
- Basic service monitoring knowledge (counters, gauges).
- Comfort reading time-series graphs.
- Familiarity with one metrics stack (e.g., Prometheus-style concepts).
Learning path
- Before: service instrumentation basics.
- This subskill: selecting metrics, writing queries, building dashboards for decisions.
- Next: alerting on SLOs and burn rates, logs and traces for deep dives.
Next steps
- Standardize metric names and units across services.
- Create a shared dashboard template for new services.
- Review dashboards during post-incident to fill blind spots.
Note: The quick test on this page is available to everyone. Only logged-in users will have their progress saved.
Mini challenge
Pick one critical endpoint. Add p50/p90/p99 latency, error %, and RPS panels. Set thresholds matching your SLO. Share with your team and ask: can we tell it’s healthy in 10 seconds?