Why Observability and Operations matter for Backend Engineers
Observability and Operations turn your backend from a black box into a system you can trust. With good logs, metrics, traces, health checks, and runbooks, you can prevent outages, detect problems early, and resolve incidents quickly. This skill unlocks on-call readiness, reliable releases, capacity planning, and data-driven performance improvements.
Who this is for
- Backend engineers building APIs, services, jobs, or event-driven systems.
- Developers preparing for on-call or production ownership.
- Engineers who want measurable reliability and faster incident resolution.
Prerequisites
- Comfort with at least one backend language (e.g., Go, Python, Java, Node.js).
- Basic HTTP, REST/gRPC, and database fundamentals.
- Familiarity with containerization and deployment (Docker, CI/CD). Kubernetes knowledge is helpful but not required.
Learning path (roadmap)
Milestone 1 — Instrumentation baseline (structured logs + health checks)
- Add structured, JSON logs with stable keys (timestamp, level, service, environment, trace_id, request_id).
- Implement /healthz (liveness) and /readyz (readiness) endpoints.
- Ensure logs and health endpoints exist for every service.
Milestone 2 — Metrics, dashboards, and SLOs
- Expose request counters, error rates, and latency histograms.
- Build dashboards: traffic, saturation, errors, latency percentiles (P50/P95/P99).
- Define SLOs (e.g., 99.9% success over 30 days) and track error budget.
Milestone 3 — Distributed tracing basics
- Propagate context across services; include trace/span IDs in logs.
- Instrument key spans: inbound HTTP, DB queries, cache calls, outbound requests.
- Sample traces strategically (e.g., errors at 100%, normal at 5%).
Milestone 4 — Alerts and on-call readiness
- Convert SLOs into alerts (page on customer-impacting burn rate, not on CPU spikes alone).
- Create escalation and quiet hours; add runbooks for every paging alert.
- Test alerts with synthetic traffic or feature flags.
Milestone 5 — Monitoring dependencies
- Track upstream/downstream dependencies with health checks and latency/error metrics.
- Add circuit breakers, timeouts, and retries with jitter.
- Set alerts on dependency health that translate to your service impact.
Milestone 6 — Capacity planning basics
- Measure key resource drivers: RPS, DB QPS, queue depth, CPU/memory.
- Estimate headroom and plan for peak load. Validate with load tests.
- Document scaling triggers (auto-scaling or manual).
Worked examples
1) Structured logging (Python)
import json, logging, sys
class JsonFormatter(logging.Formatter):
def format(self, record):
payload = {
"ts": self.formatTime(record, datefmt="%Y-%m-%dT%H:%M:%S%z"),
"level": record.levelname,
"service": "orders-api",
"env": "prod",
"message": record.getMessage(),
"request_id": getattr(record, "request_id", None),
"trace_id": getattr(record, "trace_id", None),
"user_id": getattr(record, "user_id", None)
}
return json.dumps(payload)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logger = logging.getLogger("orders")
logger.setLevel(logging.INFO)
logger.addHandler(handler)
extra = {"request_id": "req-9f1", "trace_id": "tr-448", "user_id": "u-123"}
logger.info("order_placed", extra=extra)
Sample log line: {"ts":"2026-01-20T12:00:00+0000","level":"INFO","service":"orders-api","env":"prod","message":"order_placed","request_id":"req-9f1","trace_id":"tr-448","user_id":"u-123"}
2) Prometheus metrics (Go) — latency histogram
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
reqLatency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: "orders",
Subsystem: "api",
Name: "http_request_duration_seconds",
Help: "Request latency",
Buckets: prometheus.DefBuckets, // 0.005..10s
},
[]string{"route", "method", "status"},
)
)
func handler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// ... do work ...
w.WriteHeader(200)
duration := time.Since(start).Seconds()
reqLatency.WithLabelValues("/orders", r.Method, "200").Observe(duration)
}
func main() {
prometheus.MustRegister(reqLatency)
http.HandleFunc("/orders", handler)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Dashboard: chart P50, P95, P99 for orders_api_http_request_duration_seconds by route.
3) OpenTelemetry tracing (Python) — HTTP service
from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from fastapi import FastAPI
import requests
resource = Resource(attributes={SERVICE_NAME: "orders-api"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()
@app.get("/orders")
def list_orders():
# Outbound call gets trace context automatically
r = requests.get("http://inventory:8080/stock")
return {"ok": True, "stock": r.json()}
Ensure incoming and outgoing requests propagate trace context. Include trace_id in logs for correlation.
4) Health and readiness probes (FastAPI + Kubernetes)
from fastapi import FastAPI
app = FastAPI()
@app.get("/healthz")
def health():
return {"status": "ok"}
@app.get("/readyz")
def ready():
# Check DB and cache connectivity quickly (time-bounded)
ok_db = True # replace with real ping
ok_cache = True
return {"ready": ok_db and ok_cache}
# Deployment snippet
livenessProbe:
httpGet: { path: /healthz, port: 8080 }
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet: { path: /readyz, port: 8080 }
initialDelaySeconds: 5
periodSeconds: 5
Readiness should fail if critical dependencies are unavailable; liveness should only indicate the process is alive.
5) Alerting: SLO burn rate (Prometheus-style)
groups:
- name: slo.rules
rules:
- alert: APIHighSLOBurnRate
expr: |
# Error budget burn over 1h and 5m windows (multi-window, multi-burn)
(
sum(rate(http_requests_total{job="orders",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="orders"}[5m]))
) > 0.05
and
(
sum(rate(http_requests_total{job="orders",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="orders"}[1h]))
) > 0.02
for: 10m
labels:
severity: page
annotations:
summary: "Orders API SLO burn rate high"
runbook: "See runbook: orders-api-slo-burn"
Alert pages only when user experience is impacted, not on transient spikes.
Drills and exercises
- ☐ Convert one service from plain-text to structured JSON logs. Add request_id and trace_id.
- ☐ Add a latency histogram and a counter for 5xx responses. Verify values in your metrics endpoint.
- ☐ Instrument two spans in a critical request path and view them in your tracer UI.
- ☐ Create /healthz and /readyz and simulate a dependency outage to see readiness fail.
- ☐ Draft a simple SLO (availability and latency) and compute a weekly error budget.
- ☐ Write a one-page runbook for “High latency on GET /orders”.
Common mistakes and debugging tips
Mistake: Logging everything at INFO
Tip: Use levels consistently: DEBUG for dev details, INFO for state changes, WARN for recoverable anomalies, ERROR for failures impacting the request.
Mistake: Alerts on CPU or GC only
Tip: Page on customer-impacting symptoms (SLO burn, high 5xx). Route infra metrics to non-paging alerts unless they directly affect users.
Mistake: Missing context propagation
Tip: Ensure trace headers are propagated on all outbound calls. Include trace_id and request_id in every log.
Mistake: Slow readiness checks
Tip: Bound readiness checks with short timeouts and cache results; they run often and should be cheap.
Mistake: Noisy dashboards
Tip: Create purposeful dashboards per service: Overview (golden signals), Dependency view, and Release view with deploy markers.
Mini project: Production-grade Orders API
Goal: Take a simple Orders API and make it production-ready with observability and ops.
- ☐ Structured logs with request_id, trace_id, user_id.
- ☐ Metrics: request count, error rate, latency histogram by route.
- ☐ Tracing: end-to-end trace with DB and cache spans.
- ☐ Health endpoints: /healthz (process) and /readyz (checks DB + cache).
- ☐ Dashboard: golden signals + dependency panel.
- ☐ Alert: SLO burn rate page, latency warning, error-rate warning.
- ☐ Runbook: “Orders API SLO burn” with verify, mitigate, rollback steps.
- ☐ Capacity note: show current RPS, P99 latency, and headroom at 60% CPU.
Acceptance criteria
- All endpoints exposed; metrics visible; traces link to logs via trace_id.
- Readiness flips to false when DB is unreachable; autoscaler won’t send traffic.
- Alert fires on synthetic 10% error injection; runbook instructions resolve it.
Subskills
- Structured Logging — Produce machine-parseable JSON logs with stable keys; correlate with trace_id.
- Metrics And Dashboards — Expose counters, gauges, histograms; build practical dashboards around golden signals.
- Distributed Tracing Basics — Add spans across service boundaries and propagate context.
- Alerting And On Call Basics — Turn SLOs into alerts, reduce noise, and prepare runbooks.
- Health Checks And Readiness Probes — Distinguish liveness from readiness; probe dependencies safely.
- Monitoring Dependencies — Track upstream/downstream health and protect with timeouts/retries/circuit breakers.
- Capacity Planning Basics — Measure load drivers and estimate headroom; plan scale events.
- Operational Runbooks — Create repeatable, searchable guides for incidents and routine ops.
Practical projects
- Build a “golden signals” dashboard for one service and present a 2–3 sentence summary of current health.
- Instrument end-to-end tracing across two services (API + worker) and prove context propagation.
- Create a dependency outage game day: simulate a slow DB and verify alerts, dashboards, and runbooks help you resolve it quickly.
Next steps
- Expand tracing coverage to the top 3 slowest endpoints and their hottest DB queries.
- Evolve SLOs with stakeholder input and add burn rate alerts per critical user journey.
- Automate runbook checks and add a weekly on-call review to prune noisy alerts.