Why this matters
As a Backend Engineer, you often debug issues that cross multiple services: API Gateway, services, queues, and databases. Distributed tracing shows the full path of a request through these components, with timing for each step. It helps you:
- Find the slowest hop causing latency or timeouts.
- See failed spans to locate errors quickly.
- Correlate logs and metrics with the exact request.
- Validate SLIs/SLOs by understanding where time is spent.
Who this is for
- Backend and platform engineers building or maintaining microservices.
- Engineers moving from logs-only monitoring to full observability.
- Developers who need faster incident response and root-cause analysis.
Prerequisites
- Basic HTTP knowledge (methods, headers, status codes).
- Familiarity with services calling other services and using databases/queues.
- Basic understanding of logs and metrics.
Concept explained simply
A trace is the story of one request. It is made of spans, where each span describes one operation (like an HTTP call, a DB query, or a message consume). Spans have timing, status, attributes (tags), and relationships (parent-child or links).
- Trace ID: One ID for the whole request path.
- Span ID: Unique ID for each operation.
- Parent Span ID: Shows how spans connect into a tree.
- Context propagation: Passing the trace context across process boundaries via headers (for example, W3C Trace Context).
- Sampling: Choosing which traces to record to control cost.
Mental model
Imagine a parcel traveling through sorting facilities. The parcel has one barcode (trace ID). Each facility scan is a span. Scans are linked to show the path and timing. If one facility is slow, you see the delay on that span.
Key terms you will use
- W3C Trace Context: Standard headers: traceparent and tracestate for propagating trace IDs across services.
- Attributes/Tags: Key–value pairs on spans (e.g., http.method, db.system) for filtering and analysis.
- Events/Logs-in-span: Timestamped notes inside a span (e.g., "retry started").
- Links: Connect a span to related spans when there is no strict parent-child (common with async messaging).
Worked examples
Example 1 — Simple HTTP chain: API → User Service → DB
Flow: Client calls API → API calls User Service → User Service queries DB.
- Trace ID: a single ID, e.g., 4bf92f3577b34da6a3ce929d0e0e4736
- Spans:
- Span A (API inbound): http.server.request, parent: none (root).
- Span B (API → User Service): http.client.request, parent: A.
- Span C (User Service inbound): http.server.request, parent: B.
- Span D (User Service → DB): db.query, parent: C.
- Diagnosing latency: If D takes 400 ms while others are small, the DB is the bottleneck.
Example header (W3C traceparent):
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
Example 2 — Async messaging: Order Service → Queue → Billing Worker
Flow: Order Service publishes a message; Billing Worker consumes it later.
- Order Service creates Span O (publish). It injects context into message headers.
- Billing Worker creates Span B (consume). Parent-child may be ambiguous. Use a link from B to O if processing is decoupled.
- This preserves end-to-end visibility without forcing a strict parent-child across async boundaries.
Example 3 — Error tracing and fast root cause
Flow: API → Inventory → DB. Inventory throws a timeout.
- Span I (Inventory inbound) status = ERROR, attribute error.type = timeout.
- Child Span Q (DB query) shows 499 ms duration and error = deadline_exceeded.
- Root cause: DB query duration exceeded Inventory’s timeout budget.
- Fix: Increase timeout or add an index to speed up the query.
Example 4 — Sampling impact
Head sampling at 10%: Only 1 in 10 requests record full traces. If you miss rare errors, consider tail-based sampling to keep slow/error traces while dropping the rest.
How to implement (step-by-step)
- Pick a standard: Use W3C Trace Context for cross-service propagation.
- Instrument inbound edges: Create a server span for each incoming request or message.
- Propagate context: Inject and extract trace context in outgoing HTTP calls and messages.
- Instrument critical operations: DB queries, cache calls, external APIs. Add useful attributes.
- Set sampling: Start with a small percentage; adjust based on traffic and budget.
- Correlate logs: Include trace_id and span_id in log lines for easy pivoting.
W3C Trace Context quick reference
traceparent format: version-trace-id-span-id-flags
00-- - Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
- flags: bit 0 indicates sampling (01 = sampled, 00 = not sampled).
- tracestate: optional vendor-specific data; keep as-is unless you need it.
Exercises
Hands-on practice. You can do these in a text editor or on paper. There is a quick test at the end. Everyone can take the test; only logged-in users will have progress saved.
Exercise 1 — Reconstruct a trace from logs
See the exercise card in this section and in the Exercises panel below.
- Goal: Determine the critical path and slowest span.
- Checklist to complete:
- Identify trace_id from each log line.
- Group spans by parent-child.
- Compute durations to find the bottleneck.
Exercise 2 — Parse traceparent
See the exercise card in this section and in the Exercises panel below.
- Goal: Extract version, trace-id, span-id, and flags.
- Checklist to complete:
- Split on dashes and validate lengths.
- Mark sampled if flags LSB is 1.
- Handle invalid input gracefully.
Common mistakes and how to self-check
- Missing context propagation: Symptoms: traces break into separate roots. Self-check: ensure outgoing calls include traceparent; inbound extracts it.
- Over-instrumentation: Too many spans make traces noisy. Self-check: every span should answer a diagnostic question; remove low-value spans.
- Attributes too generic: Lacking http.method/path or db.system makes filtering hard. Self-check: add key attributes that aid triage.
- Sampling hides issues: Rare errors not captured. Self-check: use rules to keep error/slow traces (tail-based or rules-based sampling).
- No log correlation: Hard to drill down. Self-check: ensure logs carry trace_id and span_id.
Practical projects
- Instrument a two-service demo: Service A calls Service B and a database. Add spans with attributes. Verify the span tree and timings.
- Add async tracing: Publish an event from Service A; consume in Service C. Use span links to relate publish and process.
- Introduce a 300 ms artificial delay in DB calls; confirm the trace highlights this span as the bottleneck.
Learning path
- Start: Understand traces, spans, and W3C Trace Context.
- Implement: Instrument inbound requests and critical operations.
- Propagate: Ensure every outbound call/message carries context.
- Correlate: Add trace IDs to logs; practice cross-navigation.
- Optimize: Tune sampling; focus on attributes that matter.
- Advance: Use span links for async; explore tail-based sampling and redaction for privacy.
Mini challenge
You see three slow traces with total latency ~1.2 s. Spans show:
- API inbound: 20 ms
- Auth service call: 80 ms
- Catalog service call: 150 ms
- DB query (child of Catalog): 900 ms
Decide on two immediate actions and one longer-term improvement. Write them down and verify against the worked examples.
Next steps
- Add log correlation for high-severity endpoints.
- Introduce rule-based or tail-based sampling to retain error/slow traces.
- Create a runbook: "If latency > X, check spans Y and Z first."
- Expand to SLO monitoring by mapping slow spans to user-facing latency budgets.