Who this is for
Backend engineers who build, run, or debug services and want reliable, explainable systems you can fix fast.
Prerequisites
- Basic HTTP and API knowledge
- Familiarity with service logs and application metrics
- Comfort reading error messages and stack traces
Why this matters
Real tasks you will face:
- Pinpoint why latency spiked after a deploy
- Find which dependency causes 5xx errors
- Set alerts that wake you for real incidents, not noise
- Prove you meet reliability targets during audits
Progress note
The quick test at the end is available to everyone. If you are logged in, your progress will be saved automatically.
Concept explained simply
Observability means answering: What is my system doing? Why is it doing that? Can I trust it?
It is built from three main signals:
- Logs: facts and context about events. Best when structured (key=value).
- Metrics: numbers over time. Best for trends and alerts.
- Traces: request journeys across services. Best for cause-and-effect.
Mental model
Think of your system as a city at night:
- Metrics are the skyline: you see overall shapes (CPU, latency, error rate).
- Logs are street-level notes: what happened at a specific place and time.
- Traces are a taxi's GPS: the exact path a request took across the city.
Correlate them using a request ID. Use metrics to detect, traces to localize, logs to explain.
Core building blocks
Logs
- Structured: use consistent fields like level, ts, service, env, trace_id, span_id.
- Levels: debug, info, warn, error; production defaults to info or warn.
- Do not log PII or secrets. Mask tokens and emails when needed.
Metrics
- Types: counter (monotonic), gauge (current value), histogram (distribution).
- Cardinality: avoid unbounded label values (e.g., user_id) to control cost.
- Percentiles: track p50/p90/p95/p99 latency using histograms for accuracy.
Traces
- Trace: a request; spans: timed operations inside it.
- Propagate a trace_id across services; add span attributes for context.
- Sampling: keep enough traces for diagnosis; rate- or tail-sample high latency/errors.
Golden signals and methods
- Golden signals: latency, traffic, errors, saturation.
- RED (for microservices): Rate, Errors, Duration.
- USE (for resources): Utilization, Saturation, Errors.
SLI, SLO, alerts
- SLI: measurement (e.g., successful request rate).
- SLO: target (e.g., 99.9% success over 30 days).
- Alert: on symptoms (user impact), not just causes (CPU spikes alone).
Dashboards and runbooks
- Dashboard flow: overview → service drilldown → instance detail → trace/log links.
- Runbooks: short steps to verify, mitigate, and escalate during incidents.
Worked examples
Example 1: API latency and availability
- SLIs: success_rate = 1 - (5xx + 429)/total; latency_p95 for GET /search
- SLOs: success_rate ≥ 99.9% (30d), p95 latency ≤ 300ms (7d)
- Alerts: page if 5-minute error_budget_burn > 10x; ticket if 1-hour burn > 2x
Why it works
Burn-rate alerts scale with window and catch fast and slow burns without noise.
Example 2: Trace a slow checkout
- Propagate trace_id from web → checkout → payment → inventory
- Add span attributes: user_tier, cart_size, payment_provider
- Tail-sample traces where checkout duration > 1s or status=error
Outcome
Traces show inventory service p95 at 850ms when cart_size>20 → target that dependency.
Example 3: Background worker logging
- Structured logs with fields: job_type, attempt, queue, duration_ms, result
- On error, log once with error, cause, retryable=true/false
- Emit metrics: counter job_failures{job_type}, histogram job_duration_ms{job_type}
Result
Fewer noisy lines, faster MTTR because logs and metrics align by job_type.
Try it: Exercises
Complete the tasks below. These match the exercises section so you can compare results.
Exercise 1: Search API SLI/SLO + alerts
Define 3 SLIs, 2 SLOs, and 2 alerts for a /search endpoint under variable load. Consider latency, availability, and saturation.
Exercise 2: Structured logging for payments
Design the log schema and examples for a payment failure so on-call can diagnose within 5 minutes.
Exercise 3: Trace plan for queue pipeline
Create a trace model across producer → queue → consumer with sampling rules to catch slow handlers.
- Checklist: Include trace_id in logs
- Checklist: Avoid high-cardinality labels
- Checklist: Alert on symptoms, not raw CPU
- Checklist: Define runbook steps per alert
Common mistakes and self-check
Mistake: No correlation between signals
Fix: Put trace_id in every log and metric label. Self-check: Can you jump from a spike to a specific trace in two clicks?
Mistake: Alert fatigue
Fix: Page on user impact (error budget burn), ticket on resource warnings. Self-check: Were you paged in the last week for a non-user-impact event?
Mistake: High metric cardinality
Fix: Drop user_id and email from labels; use cohort or tier instead. Self-check: Do any labels have unbounded values?
Mistake: Logging secrets/PII
Fix: Mask or omit. Self-check: Search logs for token= or Authorization patterns before deploying.
Mistake: Only p95 latency
Fix: Track p50, p95, p99. Self-check: Does p99 ever exceed SLO while p95 looks fine?
Practical projects
- Instrument a demo service with counters (requests), histograms (latency), and structured logs for errors.
- Create a burn-rate alert using two windows (fast 5m and slow 1h) and write a 6-line runbook.
- Add tracing spans around outbound calls; record attributes payment_provider and retry_count.
Tip: Minimal fields for production
- logs: ts, level, service, env, trace_id, msg
- metrics: counters and histograms for the golden signals
- traces: service name, span kind, duration, status
Learning path
- Start: Golden signals, SLIs, SLOs
- Instrument: Structured logs → metrics (counters/histograms) → basic traces
- Operate: Dashboards, burn-rate alerts, runbooks
- Refine: Sampling strategy, reduce cardinality, add p99
- Scale: Cross-service tracing and dependency maps
- Govern: Data retention, cost, and privacy controls
Mini challenge
You are splitting a monolith into two services: orders and inventory. Latency spikes appear during peak hours. Design:
- Two SLIs per service
- One SLO per service
- One page-worthy alert for the system
- Trace attributes you will add
Show one possible answer
Orders SLIs: success_rate, p95_latency; SLO: 99.9% success (30d). Inventory SLIs: stock_check_p95, dependency_error_rate; SLO: p95 ≤ 200ms (7d rolling). Alert: 5m error_budget_burn > 10x across either service OR checkout p99 > 1.2s for 10m. Trace attrs: user_tier, cart_size, warehouse_id, retry_count, cache_hit=false/true.
Next steps
- Apply these patterns to one service you own this week
- Review alerts for noise and convert non-impacting ones into tickets
- Run a game day: break a dependency and practice your runbook
Ready for the quick test?
Take the test below. If logged in, your score will be saved automatically.