Topic Not Found

Why this matters

As an API Engineer, you often debug issues across multiple services. Distributed tracing lets you see an entire request journey across gateways, services, databases, and queues. It helps you:

Pinpoint where latency comes from (e.g., auth vs database vs downstream API).
Reproduce and resolve errors faster with a single trace ID.
Understand dependencies and impact when releasing changes.
Measure SLOs for end-to-end operations, not just one service.

Who this is for

API Engineers building or maintaining microservices.
Backend devs adding observability to existing services.
Platform/infra engineers setting up telemetry pipelines.

Prerequisites

Basic HTTP knowledge (headers, request/response).
Familiarity with microservices and async processing (queues, background jobs).
Basic logs and metrics concepts.

Concept explained simply

A trace is the story of one request as it moves through your system. The story is made of spans. Each span is a timed unit of work (e.g., HTTP handler, DB query, RPC call). Spans form a tree with parent-child relationships. A trace has a unique trace_id; each span has its own span_id. A standard header (traceparent) carries this context across services so they can attach their spans to the same trace.

Key pieces:

Trace: the whole journey.
Span: one step in the journey; has start/end time, status, attributes.
Propagation: passing trace context between services via headers.
Sampling: choosing which traces to keep to control cost (e.g., 10% of requests, always keep errors).
Attributes/tags: key-value details (http.method, db.system, user_id if allowed).
Baggage: small set of business keys that travel with the trace (e.g., tenant_id).

Mental model

Think of a trace like a shipping package with a unique tracking number (trace_id). Every facility the package passes through creates a stamped record (span) with start/end time and notes. The tracking number always stays on the label (propagation). If you can see all stamps in order, you can spot delays, mistakes, or lost steps.

Key terms

trace_id: Unique ID for the full request path.
span_id: Unique ID for a single unit of work.
parent_span_id: The span that triggered this span.
traceparent header: Standard W3C header carrying version, trace_id, parent_span_id, and flags.
Sampling: head-based (decide at the start) or tail-based (decide after seeing the whole trace).

Worked examples

Example 1: Basic HTTP request across 3 services

Flow: Client -> API Gateway -> Service A -> Service B -> DB

Gateway receives request, creates root span (gateway.handle_request).
Gateway calls Service A, forwarding traceparent.
Service A creates a child span (serviceA.handle) and calls Service B.
Service B creates a child span (serviceB.handle) and a nested span for DB (db.query).

Trace: 4f...ab (trace_id)
- gateway.handle_request (span_id=11..aa, parent=null)
  - serviceA.handle (span_id=22..bb, parent=11..aa)
    - serviceB.handle (span_id=33..cc, parent=22..bb)
      - db.query (span_id=44..dd, parent=33..cc)

Outcome: You can see which hop added latency.

Example 2: Error propagation with status

Service B throws a 503 error to Service A. How to reflect it:

serviceB.handle span status=ERROR, status_message="Upstream timeout".
serviceA.handle span may be OK or ERROR based on retry/handling logic; it should at least add event="downstream_error".
gateway.handle_request may return 503 and set status=ERROR.

In the trace UI, the red spans quickly highlight the failure point.

Example 3: Asynchronous queue

Service A publishes a message to queue Q. A worker consumes and processes it later.

Publisher creates span publish.message with attributes (queue.name=Q).
Include trace context in message headers/metadata.
Worker extracts the context and creates a child span process.message.

Trace: 9a..77
- serviceA.handle
  - publish.message
    - process.message (in worker process)

Example 4: Sampling strategy in practice

High-volume read endpoint (GET /products) and low-volume write endpoint (POST /checkout):

Head-based: sample 5-10% on GET, 100% on POST.
Always sample spans with status=ERROR.
Optionally use tail-based for long or high-latency traces if your backend supports it.

Implementation steps

Pick a standard: Use W3C Trace Context (traceparent, tracestate).
Create spans at the right places: inbound requests, outbound calls, DB operations, queue publish/consume.
Propagate context: read incoming traceparent; write it on outgoing HTTP/queue messages.
Record useful attributes: http.method, http.route, http.status_code, net.peer.name, db.system, db.statement (redact secrets), retry_count, tenant_id (if allowed).
Handle errors: set status=ERROR; add exception type/message as attributes or event; keep stack traces in logs.
Sampling: start with 5-10% head-based plus always-on for errors and critical endpoints; refine later.
Validate: trace a known request; confirm all hops appear; check parent-child links and timing.

Self-check after implementation

Does each inbound request create exactly one root span?
Do downstream calls create child spans with the correct parent?
Are DB/queue operations captured with useful attributes?
Do error responses mark spans as ERROR?
Does the trace show realistic durations (no negative/zero unless expected)?

Exercises

Try these. The same exercises with solutions are also listed in the Exercises section below this lesson.

ex1: Model a trace for a 3-service call
Given a client request through Gateway -> Service A -> Service B -> DB, assign plausible span names and correct parent/child relationships, including a DB span.
ex2: Propagate W3C traceparent
Given an incoming traceparent at Service A, compute a new span_id for the outbound call to Service B and produce the header to send.
ex3: Choose a sampling strategy
Your read endpoints are very high-volume; writes are low-volume. Define a simple sampling plan that keeps costs predictable but preserves high-value data.

Exercise checklist

Every span has one parent except the root.
Span names reflect work (e.g., serviceB.handle, db.query).
traceparent parent span id matches the caller's span id.
Sampling plan clearly states rates and error rules.

Common mistakes and how to self-check

Forgetting propagation: Requests appear as separate traces. Self-check: follow one user request and verify a single trace_id through logs/headers.
Too many spans: Noise and cost. Self-check: sample a few traces; if they are hard to read, reduce low-value spans.
Missing DB/queue spans: You lose bottleneck visibility. Self-check: does each external I/O create a span?
Leaking sensitive data in attributes: Self-check: search attributes for tokens, passwords, PII; redact or avoid.
Only sampling successes: Self-check: ensure errors are always kept.

Practical projects

Add tracing to one endpoint across two services and a database. Verify latency breakdown.
Instrument a queue: publish in Service A, consume in Worker B, preserving context.
Create a simple service map by listing unique peer services observed in outbound spans.

Learning path

Start: Distributed Tracing Basics (this lesson).
Next: Metrics and RED/USE patterns to complement traces.
Then: Alerting on SLOs using trace-derived metrics (e.g., latency percentiles, error rate).
Advanced: Tail-based sampling, trace analysis, and service graphs.

Next steps

Instrument one real endpoint end-to-end.
Add error attributes and confirm failure visibility.
Roll out a conservative sampling policy and review costs weekly.

Mini challenge

Pick a user action (e.g., checkout). Draw the expected trace tree with 6–10 spans, including at least one external call and one DB operation. Mark where you would place attributes and which spans you would always sample.

Practice & Test

Ready to check your understanding? Take the quick test below. Anyone can take it for free. If you are logged in, your progress will be saved.

Menu

Distributed Tracing Basics

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Key terms

Worked examples

Implementation steps

Exercises

Common mistakes and how to self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice & Test

Practice Exercises

Model a trace for a 3-service call

Instructions

Expected Output

Propagate W3C traceparent

Choose a sampling strategy

Distributed Tracing Basics — Quick Test

Have questions about Distributed Tracing Basics?

AI Assistant