Why this matters
As an API Engineer, you often debug issues across multiple services. Distributed tracing lets you see an entire request journey across gateways, services, databases, and queues. It helps you:
- Pinpoint where latency comes from (e.g., auth vs database vs downstream API).
- Reproduce and resolve errors faster with a single trace ID.
- Understand dependencies and impact when releasing changes.
- Measure SLOs for end-to-end operations, not just one service.
Who this is for
- API Engineers building or maintaining microservices.
- Backend devs adding observability to existing services.
- Platform/infra engineers setting up telemetry pipelines.
Prerequisites
- Basic HTTP knowledge (headers, request/response).
- Familiarity with microservices and async processing (queues, background jobs).
- Basic logs and metrics concepts.
Concept explained simply
A trace is the story of one request as it moves through your system. The story is made of spans. Each span is a timed unit of work (e.g., HTTP handler, DB query, RPC call). Spans form a tree with parent-child relationships. A trace has a unique trace_id; each span has its own span_id. A standard header (traceparent) carries this context across services so they can attach their spans to the same trace.
Key pieces:
- Trace: the whole journey.
- Span: one step in the journey; has start/end time, status, attributes.
- Propagation: passing trace context between services via headers.
- Sampling: choosing which traces to keep to control cost (e.g., 10% of requests, always keep errors).
- Attributes/tags: key-value details (http.method, db.system, user_id if allowed).
- Baggage: small set of business keys that travel with the trace (e.g., tenant_id).
Mental model
Think of a trace like a shipping package with a unique tracking number (trace_id). Every facility the package passes through creates a stamped record (span) with start/end time and notes. The tracking number always stays on the label (propagation). If you can see all stamps in order, you can spot delays, mistakes, or lost steps.
Key terms
- trace_id: Unique ID for the full request path.
- span_id: Unique ID for a single unit of work.
- parent_span_id: The span that triggered this span.
- traceparent header: Standard W3C header carrying version, trace_id, parent_span_id, and flags.
- Sampling: head-based (decide at the start) or tail-based (decide after seeing the whole trace).
Worked examples
Example 1: Basic HTTP request across 3 services
Flow: Client -> API Gateway -> Service A -> Service B -> DB
- Gateway receives request, creates root span (gateway.handle_request).
- Gateway calls Service A, forwarding traceparent.
- Service A creates a child span (serviceA.handle) and calls Service B.
- Service B creates a child span (serviceB.handle) and a nested span for DB (db.query).
Trace: 4f...ab (trace_id)
- gateway.handle_request (span_id=11..aa, parent=null)
- serviceA.handle (span_id=22..bb, parent=11..aa)
- serviceB.handle (span_id=33..cc, parent=22..bb)
- db.query (span_id=44..dd, parent=33..cc)
Outcome: You can see which hop added latency.
Example 2: Error propagation with status
Service B throws a 503 error to Service A. How to reflect it:
- serviceB.handle span status=ERROR, status_message="Upstream timeout".
- serviceA.handle span may be OK or ERROR based on retry/handling logic; it should at least add event="downstream_error".
- gateway.handle_request may return 503 and set status=ERROR.
In the trace UI, the red spans quickly highlight the failure point.
Example 3: Asynchronous queue
Service A publishes a message to queue Q. A worker consumes and processes it later.
- Publisher creates span publish.message with attributes (queue.name=Q).
- Include trace context in message headers/metadata.
- Worker extracts the context and creates a child span process.message.
Trace: 9a..77
- serviceA.handle
- publish.message
- process.message (in worker process)
Example 4: Sampling strategy in practice
High-volume read endpoint (GET /products) and low-volume write endpoint (POST /checkout):
- Head-based: sample 5-10% on GET, 100% on POST.
- Always sample spans with status=ERROR.
- Optionally use tail-based for long or high-latency traces if your backend supports it.
Implementation steps
- Pick a standard: Use W3C Trace Context (traceparent, tracestate).
- Create spans at the right places: inbound requests, outbound calls, DB operations, queue publish/consume.
- Propagate context: read incoming traceparent; write it on outgoing HTTP/queue messages.
- Record useful attributes: http.method, http.route, http.status_code, net.peer.name, db.system, db.statement (redact secrets), retry_count, tenant_id (if allowed).
- Handle errors: set status=ERROR; add exception type/message as attributes or event; keep stack traces in logs.
- Sampling: start with 5-10% head-based plus always-on for errors and critical endpoints; refine later.
- Validate: trace a known request; confirm all hops appear; check parent-child links and timing.
Self-check after implementation
- Does each inbound request create exactly one root span?
- Do downstream calls create child spans with the correct parent?
- Are DB/queue operations captured with useful attributes?
- Do error responses mark spans as ERROR?
- Does the trace show realistic durations (no negative/zero unless expected)?
Exercises
Try these. The same exercises with solutions are also listed in the Exercises section below this lesson.
- ex1: Model a trace for a 3-service call
Given a client request through Gateway -> Service A -> Service B -> DB, assign plausible span names and correct parent/child relationships, including a DB span. - ex2: Propagate W3C traceparent
Given an incoming traceparent at Service A, compute a new span_id for the outbound call to Service B and produce the header to send. - ex3: Choose a sampling strategy
Your read endpoints are very high-volume; writes are low-volume. Define a simple sampling plan that keeps costs predictable but preserves high-value data.
Exercise checklist
- Every span has one parent except the root.
- Span names reflect work (e.g., serviceB.handle, db.query).
- traceparent parent span id matches the caller's span id.
- Sampling plan clearly states rates and error rules.
Common mistakes and how to self-check
- Forgetting propagation: Requests appear as separate traces. Self-check: follow one user request and verify a single trace_id through logs/headers.
- Too many spans: Noise and cost. Self-check: sample a few traces; if they are hard to read, reduce low-value spans.
- Missing DB/queue spans: You lose bottleneck visibility. Self-check: does each external I/O create a span?
- Leaking sensitive data in attributes: Self-check: search attributes for tokens, passwords, PII; redact or avoid.
- Only sampling successes: Self-check: ensure errors are always kept.
Practical projects
- Add tracing to one endpoint across two services and a database. Verify latency breakdown.
- Instrument a queue: publish in Service A, consume in Worker B, preserving context.
- Create a simple service map by listing unique peer services observed in outbound spans.
Learning path
- Start: Distributed Tracing Basics (this lesson).
- Next: Metrics and RED/USE patterns to complement traces.
- Then: Alerting on SLOs using trace-derived metrics (e.g., latency percentiles, error rate).
- Advanced: Tail-based sampling, trace analysis, and service graphs.
Next steps
- Instrument one real endpoint end-to-end.
- Add error attributes and confirm failure visibility.
- Roll out a conservative sampling policy and review costs weekly.
Mini challenge
Pick a user action (e.g., checkout). Draw the expected trace tree with 6–10 spans, including at least one external call and one DB operation. Mark where you would place attributes and which spans you would always sample.
Practice & Test
Ready to check your understanding? Take the quick test below. Anyone can take it for free. If you are logged in, your progress will be saved.