Menu

Topic 5 of 8

Observability Concepts

Learn Observability Concepts for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Who this is for

Backend engineers who build, run, or debug services and want reliable, explainable systems you can fix fast.

Prerequisites

  • Basic HTTP and API knowledge
  • Familiarity with service logs and application metrics
  • Comfort reading error messages and stack traces

Why this matters

Real tasks you will face:

  • Pinpoint why latency spiked after a deploy
  • Find which dependency causes 5xx errors
  • Set alerts that wake you for real incidents, not noise
  • Prove you meet reliability targets during audits
Progress note

The quick test at the end is available to everyone. If you are logged in, your progress will be saved automatically.

Concept explained simply

Observability means answering: What is my system doing? Why is it doing that? Can I trust it?

It is built from three main signals:

  • Logs: facts and context about events. Best when structured (key=value).
  • Metrics: numbers over time. Best for trends and alerts.
  • Traces: request journeys across services. Best for cause-and-effect.

Mental model

Think of your system as a city at night:

  • Metrics are the skyline: you see overall shapes (CPU, latency, error rate).
  • Logs are street-level notes: what happened at a specific place and time.
  • Traces are a taxi's GPS: the exact path a request took across the city.

Correlate them using a request ID. Use metrics to detect, traces to localize, logs to explain.

Core building blocks

Logs
  • Structured: use consistent fields like level, ts, service, env, trace_id, span_id.
  • Levels: debug, info, warn, error; production defaults to info or warn.
  • Do not log PII or secrets. Mask tokens and emails when needed.
Metrics
  • Types: counter (monotonic), gauge (current value), histogram (distribution).
  • Cardinality: avoid unbounded label values (e.g., user_id) to control cost.
  • Percentiles: track p50/p90/p95/p99 latency using histograms for accuracy.
Traces
  • Trace: a request; spans: timed operations inside it.
  • Propagate a trace_id across services; add span attributes for context.
  • Sampling: keep enough traces for diagnosis; rate- or tail-sample high latency/errors.
Golden signals and methods
  • Golden signals: latency, traffic, errors, saturation.
  • RED (for microservices): Rate, Errors, Duration.
  • USE (for resources): Utilization, Saturation, Errors.
SLI, SLO, alerts
  • SLI: measurement (e.g., successful request rate).
  • SLO: target (e.g., 99.9% success over 30 days).
  • Alert: on symptoms (user impact), not just causes (CPU spikes alone).
Dashboards and runbooks
  • Dashboard flow: overview → service drilldown → instance detail → trace/log links.
  • Runbooks: short steps to verify, mitigate, and escalate during incidents.

Worked examples

Example 1: API latency and availability

  1. SLIs: success_rate = 1 - (5xx + 429)/total; latency_p95 for GET /search
  2. SLOs: success_rate ≥ 99.9% (30d), p95 latency ≤ 300ms (7d)
  3. Alerts: page if 5-minute error_budget_burn > 10x; ticket if 1-hour burn > 2x
Why it works

Burn-rate alerts scale with window and catch fast and slow burns without noise.

Example 2: Trace a slow checkout

  1. Propagate trace_id from web → checkout → payment → inventory
  2. Add span attributes: user_tier, cart_size, payment_provider
  3. Tail-sample traces where checkout duration > 1s or status=error
Outcome

Traces show inventory service p95 at 850ms when cart_size>20 → target that dependency.

Example 3: Background worker logging

  1. Structured logs with fields: job_type, attempt, queue, duration_ms, result
  2. On error, log once with error, cause, retryable=true/false
  3. Emit metrics: counter job_failures{job_type}, histogram job_duration_ms{job_type}
Result

Fewer noisy lines, faster MTTR because logs and metrics align by job_type.

Try it: Exercises

Complete the tasks below. These match the exercises section so you can compare results.

Exercise 1: Search API SLI/SLO + alerts

Define 3 SLIs, 2 SLOs, and 2 alerts for a /search endpoint under variable load. Consider latency, availability, and saturation.

Exercise 2: Structured logging for payments

Design the log schema and examples for a payment failure so on-call can diagnose within 5 minutes.

Exercise 3: Trace plan for queue pipeline

Create a trace model across producer → queue → consumer with sampling rules to catch slow handlers.

  • Checklist: Include trace_id in logs
  • Checklist: Avoid high-cardinality labels
  • Checklist: Alert on symptoms, not raw CPU
  • Checklist: Define runbook steps per alert

Common mistakes and self-check

Mistake: No correlation between signals

Fix: Put trace_id in every log and metric label. Self-check: Can you jump from a spike to a specific trace in two clicks?

Mistake: Alert fatigue

Fix: Page on user impact (error budget burn), ticket on resource warnings. Self-check: Were you paged in the last week for a non-user-impact event?

Mistake: High metric cardinality

Fix: Drop user_id and email from labels; use cohort or tier instead. Self-check: Do any labels have unbounded values?

Mistake: Logging secrets/PII

Fix: Mask or omit. Self-check: Search logs for token= or Authorization patterns before deploying.

Mistake: Only p95 latency

Fix: Track p50, p95, p99. Self-check: Does p99 ever exceed SLO while p95 looks fine?

Practical projects

  1. Instrument a demo service with counters (requests), histograms (latency), and structured logs for errors.
  2. Create a burn-rate alert using two windows (fast 5m and slow 1h) and write a 6-line runbook.
  3. Add tracing spans around outbound calls; record attributes payment_provider and retry_count.
Tip: Minimal fields for production
  • logs: ts, level, service, env, trace_id, msg
  • metrics: counters and histograms for the golden signals
  • traces: service name, span kind, duration, status

Learning path

  1. Start: Golden signals, SLIs, SLOs
  2. Instrument: Structured logs → metrics (counters/histograms) → basic traces
  3. Operate: Dashboards, burn-rate alerts, runbooks
  4. Refine: Sampling strategy, reduce cardinality, add p99
  5. Scale: Cross-service tracing and dependency maps
  6. Govern: Data retention, cost, and privacy controls

Mini challenge

You are splitting a monolith into two services: orders and inventory. Latency spikes appear during peak hours. Design:

  • Two SLIs per service
  • One SLO per service
  • One page-worthy alert for the system
  • Trace attributes you will add
Show one possible answer

Orders SLIs: success_rate, p95_latency; SLO: 99.9% success (30d). Inventory SLIs: stock_check_p95, dependency_error_rate; SLO: p95 ≤ 200ms (7d rolling). Alert: 5m error_budget_burn > 10x across either service OR checkout p99 > 1.2s for 10m. Trace attrs: user_tier, cart_size, warehouse_id, retry_count, cache_hit=false/true.

Next steps

  • Apply these patterns to one service you own this week
  • Review alerts for noise and convert non-impacting ones into tickets
  • Run a game day: break a dependency and practice your runbook
Ready for the quick test?

Take the test below. If logged in, your score will be saved automatically.

Practice Exercises

3 exercises to complete

Instructions

You own GET /search. Traffic is spiky and has occasional 5xx bursts from a downstream index. Define:

  • 3 SLIs (with formulas)
  • 2 SLOs (targets and windows)
  • 2 alerts (burn-rate style: one fast, one slow)
Context to consider
  • Users care about fast responses and few failures
  • Occasional 429 (rate limit) is acceptable
  • CPU spikes happen during reindex; avoid paging then unless users suffer
Expected Output
Three SLIs with formulas, two SLOs with concrete targets and windows, and two burn-rate alerts referencing error budget.

Observability Concepts — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Observability Concepts?

AI Assistant

Ask questions about this tool