Menu

Observability And Operations

Learn Observability And Operations for Backend Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 20, 2026 | Updated: January 20, 2026

Why Observability and Operations matter for Backend Engineers

Observability and Operations turn your backend from a black box into a system you can trust. With good logs, metrics, traces, health checks, and runbooks, you can prevent outages, detect problems early, and resolve incidents quickly. This skill unlocks on-call readiness, reliable releases, capacity planning, and data-driven performance improvements.

Who this is for

  • Backend engineers building APIs, services, jobs, or event-driven systems.
  • Developers preparing for on-call or production ownership.
  • Engineers who want measurable reliability and faster incident resolution.

Prerequisites

  • Comfort with at least one backend language (e.g., Go, Python, Java, Node.js).
  • Basic HTTP, REST/gRPC, and database fundamentals.
  • Familiarity with containerization and deployment (Docker, CI/CD). Kubernetes knowledge is helpful but not required.

Learning path (roadmap)

Milestone 1 — Instrumentation baseline (structured logs + health checks)
  • Add structured, JSON logs with stable keys (timestamp, level, service, environment, trace_id, request_id).
  • Implement /healthz (liveness) and /readyz (readiness) endpoints.
  • Ensure logs and health endpoints exist for every service.
Milestone 2 — Metrics, dashboards, and SLOs
  • Expose request counters, error rates, and latency histograms.
  • Build dashboards: traffic, saturation, errors, latency percentiles (P50/P95/P99).
  • Define SLOs (e.g., 99.9% success over 30 days) and track error budget.
Milestone 3 — Distributed tracing basics
  • Propagate context across services; include trace/span IDs in logs.
  • Instrument key spans: inbound HTTP, DB queries, cache calls, outbound requests.
  • Sample traces strategically (e.g., errors at 100%, normal at 5%).
Milestone 4 — Alerts and on-call readiness
  • Convert SLOs into alerts (page on customer-impacting burn rate, not on CPU spikes alone).
  • Create escalation and quiet hours; add runbooks for every paging alert.
  • Test alerts with synthetic traffic or feature flags.
Milestone 5 — Monitoring dependencies
  • Track upstream/downstream dependencies with health checks and latency/error metrics.
  • Add circuit breakers, timeouts, and retries with jitter.
  • Set alerts on dependency health that translate to your service impact.
Milestone 6 — Capacity planning basics
  • Measure key resource drivers: RPS, DB QPS, queue depth, CPU/memory.
  • Estimate headroom and plan for peak load. Validate with load tests.
  • Document scaling triggers (auto-scaling or manual).

Worked examples

1) Structured logging (Python)
import json, logging, sys

class JsonFormatter(logging.Formatter):
    def format(self, record):
        payload = {
            "ts": self.formatTime(record, datefmt="%Y-%m-%dT%H:%M:%S%z"),
            "level": record.levelname,
            "service": "orders-api",
            "env": "prod",
            "message": record.getMessage(),
            "request_id": getattr(record, "request_id", None),
            "trace_id": getattr(record, "trace_id", None),
            "user_id": getattr(record, "user_id", None)
        }
        return json.dumps(payload)

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logger = logging.getLogger("orders")
logger.setLevel(logging.INFO)
logger.addHandler(handler)

extra = {"request_id": "req-9f1", "trace_id": "tr-448", "user_id": "u-123"}
logger.info("order_placed", extra=extra)

Sample log line: {"ts":"2026-01-20T12:00:00+0000","level":"INFO","service":"orders-api","env":"prod","message":"order_placed","request_id":"req-9f1","trace_id":"tr-448","user_id":"u-123"}

2) Prometheus metrics (Go) — latency histogram
package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    reqLatency = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Namespace: "orders",
            Subsystem: "api",
            Name:      "http_request_duration_seconds",
            Help:      "Request latency",
            Buckets:   prometheus.DefBuckets, // 0.005..10s
        },
        []string{"route", "method", "status"},
    )
)

func handler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    // ... do work ...
    w.WriteHeader(200)
    duration := time.Since(start).Seconds()
    reqLatency.WithLabelValues("/orders", r.Method, "200").Observe(duration)
}

func main() {
    prometheus.MustRegister(reqLatency)
    http.HandleFunc("/orders", handler)
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Dashboard: chart P50, P95, P99 for orders_api_http_request_duration_seconds by route.

3) OpenTelemetry tracing (Python) — HTTP service
from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from fastapi import FastAPI
import requests

resource = Resource(attributes={SERVICE_NAME: "orders-api"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()

@app.get("/orders")
def list_orders():
    # Outbound call gets trace context automatically
    r = requests.get("http://inventory:8080/stock")
    return {"ok": True, "stock": r.json()}

Ensure incoming and outgoing requests propagate trace context. Include trace_id in logs for correlation.

4) Health and readiness probes (FastAPI + Kubernetes)
from fastapi import FastAPI
app = FastAPI()

@app.get("/healthz")
def health():
    return {"status": "ok"}

@app.get("/readyz")
def ready():
    # Check DB and cache connectivity quickly (time-bounded)
    ok_db = True  # replace with real ping
    ok_cache = True
    return {"ready": ok_db and ok_cache}
# Deployment snippet
livenessProbe:
  httpGet: { path: /healthz, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /readyz, port: 8080 }
  initialDelaySeconds: 5
  periodSeconds: 5

Readiness should fail if critical dependencies are unavailable; liveness should only indicate the process is alive.

5) Alerting: SLO burn rate (Prometheus-style)
groups:
- name: slo.rules
  rules:
  - alert: APIHighSLOBurnRate
    expr: |
      # Error budget burn over 1h and 5m windows (multi-window, multi-burn)
      (
        sum(rate(http_requests_total{job="orders",status=~"5.."}[5m]))
        /
        sum(rate(http_requests_total{job="orders"}[5m]))
      ) > 0.05
      and
      (
        sum(rate(http_requests_total{job="orders",status=~"5.."}[1h]))
        /
        sum(rate(http_requests_total{job="orders"}[1h]))
      ) > 0.02
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "Orders API SLO burn rate high"
      runbook: "See runbook: orders-api-slo-burn"

Alert pages only when user experience is impacted, not on transient spikes.

Drills and exercises

  • ☐ Convert one service from plain-text to structured JSON logs. Add request_id and trace_id.
  • ☐ Add a latency histogram and a counter for 5xx responses. Verify values in your metrics endpoint.
  • ☐ Instrument two spans in a critical request path and view them in your tracer UI.
  • ☐ Create /healthz and /readyz and simulate a dependency outage to see readiness fail.
  • ☐ Draft a simple SLO (availability and latency) and compute a weekly error budget.
  • ☐ Write a one-page runbook for “High latency on GET /orders”.

Common mistakes and debugging tips

Mistake: Logging everything at INFO

Tip: Use levels consistently: DEBUG for dev details, INFO for state changes, WARN for recoverable anomalies, ERROR for failures impacting the request.

Mistake: Alerts on CPU or GC only

Tip: Page on customer-impacting symptoms (SLO burn, high 5xx). Route infra metrics to non-paging alerts unless they directly affect users.

Mistake: Missing context propagation

Tip: Ensure trace headers are propagated on all outbound calls. Include trace_id and request_id in every log.

Mistake: Slow readiness checks

Tip: Bound readiness checks with short timeouts and cache results; they run often and should be cheap.

Mistake: Noisy dashboards

Tip: Create purposeful dashboards per service: Overview (golden signals), Dependency view, and Release view with deploy markers.

Mini project: Production-grade Orders API

Goal: Take a simple Orders API and make it production-ready with observability and ops.

  • ☐ Structured logs with request_id, trace_id, user_id.
  • ☐ Metrics: request count, error rate, latency histogram by route.
  • ☐ Tracing: end-to-end trace with DB and cache spans.
  • ☐ Health endpoints: /healthz (process) and /readyz (checks DB + cache).
  • ☐ Dashboard: golden signals + dependency panel.
  • ☐ Alert: SLO burn rate page, latency warning, error-rate warning.
  • ☐ Runbook: “Orders API SLO burn” with verify, mitigate, rollback steps.
  • ☐ Capacity note: show current RPS, P99 latency, and headroom at 60% CPU.
Acceptance criteria
  • All endpoints exposed; metrics visible; traces link to logs via trace_id.
  • Readiness flips to false when DB is unreachable; autoscaler won’t send traffic.
  • Alert fires on synthetic 10% error injection; runbook instructions resolve it.

Subskills

  • Structured Logging — Produce machine-parseable JSON logs with stable keys; correlate with trace_id.
  • Metrics And Dashboards — Expose counters, gauges, histograms; build practical dashboards around golden signals.
  • Distributed Tracing Basics — Add spans across service boundaries and propagate context.
  • Alerting And On Call Basics — Turn SLOs into alerts, reduce noise, and prepare runbooks.
  • Health Checks And Readiness Probes — Distinguish liveness from readiness; probe dependencies safely.
  • Monitoring Dependencies — Track upstream/downstream health and protect with timeouts/retries/circuit breakers.
  • Capacity Planning Basics — Measure load drivers and estimate headroom; plan scale events.
  • Operational Runbooks — Create repeatable, searchable guides for incidents and routine ops.

Practical projects

  • Build a “golden signals” dashboard for one service and present a 2–3 sentence summary of current health.
  • Instrument end-to-end tracing across two services (API + worker) and prove context propagation.
  • Create a dependency outage game day: simulate a slow DB and verify alerts, dashboards, and runbooks help you resolve it quickly.

Next steps

  • Expand tracing coverage to the top 3 slowest endpoints and their hottest DB queries.
  • Evolve SLOs with stakeholder input and add burn rate alerts per critical user journey.
  • Automate runbook checks and add a weekly on-call review to prune noisy alerts.

Observability And Operations — Skill Exam

This exam checks your practical understanding of observability and operations. You can take it for free—no signup required. If you log in, your progress and results will be saved so you can resume later.Rules: open notes allowed; no time limit. Aim for at least 70% to pass. Read each scenario carefully and pick the best answer(s).

12 questions70% to pass

Have questions about Observability And Operations?

AI Assistant

Ask questions about this tool