Menu

Topic 5 of 8

Distributed Tracing Basics

Learn Distributed Tracing Basics for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Why this matters

As a Backend Engineer, you often debug issues that cross multiple services: API Gateway, services, queues, and databases. Distributed tracing shows the full path of a request through these components, with timing for each step. It helps you:

  • Find the slowest hop causing latency or timeouts.
  • See failed spans to locate errors quickly.
  • Correlate logs and metrics with the exact request.
  • Validate SLIs/SLOs by understanding where time is spent.

Who this is for

  • Backend and platform engineers building or maintaining microservices.
  • Engineers moving from logs-only monitoring to full observability.
  • Developers who need faster incident response and root-cause analysis.

Prerequisites

  • Basic HTTP knowledge (methods, headers, status codes).
  • Familiarity with services calling other services and using databases/queues.
  • Basic understanding of logs and metrics.

Concept explained simply

A trace is the story of one request. It is made of spans, where each span describes one operation (like an HTTP call, a DB query, or a message consume). Spans have timing, status, attributes (tags), and relationships (parent-child or links).

  • Trace ID: One ID for the whole request path.
  • Span ID: Unique ID for each operation.
  • Parent Span ID: Shows how spans connect into a tree.
  • Context propagation: Passing the trace context across process boundaries via headers (for example, W3C Trace Context).
  • Sampling: Choosing which traces to record to control cost.

Mental model

Imagine a parcel traveling through sorting facilities. The parcel has one barcode (trace ID). Each facility scan is a span. Scans are linked to show the path and timing. If one facility is slow, you see the delay on that span.

Key terms you will use

  • W3C Trace Context: Standard headers: traceparent and tracestate for propagating trace IDs across services.
  • Attributes/Tags: Key–value pairs on spans (e.g., http.method, db.system) for filtering and analysis.
  • Events/Logs-in-span: Timestamped notes inside a span (e.g., "retry started").
  • Links: Connect a span to related spans when there is no strict parent-child (common with async messaging).

Worked examples

Example 1 — Simple HTTP chain: API → User Service → DB

Flow: Client calls API → API calls User Service → User Service queries DB.

  • Trace ID: a single ID, e.g., 4bf92f3577b34da6a3ce929d0e0e4736
  • Spans:
    • Span A (API inbound): http.server.request, parent: none (root).
    • Span B (API → User Service): http.client.request, parent: A.
    • Span C (User Service inbound): http.server.request, parent: B.
    • Span D (User Service → DB): db.query, parent: C.
  • Diagnosing latency: If D takes 400 ms while others are small, the DB is the bottleneck.

Example header (W3C traceparent):

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
Example 2 — Async messaging: Order Service → Queue → Billing Worker

Flow: Order Service publishes a message; Billing Worker consumes it later.

  • Order Service creates Span O (publish). It injects context into message headers.
  • Billing Worker creates Span B (consume). Parent-child may be ambiguous. Use a link from B to O if processing is decoupled.
  • This preserves end-to-end visibility without forcing a strict parent-child across async boundaries.
Example 3 — Error tracing and fast root cause

Flow: API → Inventory → DB. Inventory throws a timeout.

  • Span I (Inventory inbound) status = ERROR, attribute error.type = timeout.
  • Child Span Q (DB query) shows 499 ms duration and error = deadline_exceeded.
  • Root cause: DB query duration exceeded Inventory’s timeout budget.
  • Fix: Increase timeout or add an index to speed up the query.
Example 4 — Sampling impact

Head sampling at 10%: Only 1 in 10 requests record full traces. If you miss rare errors, consider tail-based sampling to keep slow/error traces while dropping the rest.

How to implement (step-by-step)

  1. Pick a standard: Use W3C Trace Context for cross-service propagation.
  2. Instrument inbound edges: Create a server span for each incoming request or message.
  3. Propagate context: Inject and extract trace context in outgoing HTTP calls and messages.
  4. Instrument critical operations: DB queries, cache calls, external APIs. Add useful attributes.
  5. Set sampling: Start with a small percentage; adjust based on traffic and budget.
  6. Correlate logs: Include trace_id and span_id in log lines for easy pivoting.
W3C Trace Context quick reference

traceparent format: version-trace-id-span-id-flags

00---
Example:
00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
  • flags: bit 0 indicates sampling (01 = sampled, 00 = not sampled).
  • tracestate: optional vendor-specific data; keep as-is unless you need it.

Exercises

Hands-on practice. You can do these in a text editor or on paper. There is a quick test at the end. Everyone can take the test; only logged-in users will have progress saved.

Exercise 1 — Reconstruct a trace from logs

See the exercise card in this section and in the Exercises panel below.

  • Goal: Determine the critical path and slowest span.
  • Checklist to complete:
    • Identify trace_id from each log line.
    • Group spans by parent-child.
    • Compute durations to find the bottleneck.

Exercise 2 — Parse traceparent

See the exercise card in this section and in the Exercises panel below.

  • Goal: Extract version, trace-id, span-id, and flags.
  • Checklist to complete:
    • Split on dashes and validate lengths.
    • Mark sampled if flags LSB is 1.
    • Handle invalid input gracefully.

Common mistakes and how to self-check

  • Missing context propagation: Symptoms: traces break into separate roots. Self-check: ensure outgoing calls include traceparent; inbound extracts it.
  • Over-instrumentation: Too many spans make traces noisy. Self-check: every span should answer a diagnostic question; remove low-value spans.
  • Attributes too generic: Lacking http.method/path or db.system makes filtering hard. Self-check: add key attributes that aid triage.
  • Sampling hides issues: Rare errors not captured. Self-check: use rules to keep error/slow traces (tail-based or rules-based sampling).
  • No log correlation: Hard to drill down. Self-check: ensure logs carry trace_id and span_id.

Practical projects

  • Instrument a two-service demo: Service A calls Service B and a database. Add spans with attributes. Verify the span tree and timings.
  • Add async tracing: Publish an event from Service A; consume in Service C. Use span links to relate publish and process.
  • Introduce a 300 ms artificial delay in DB calls; confirm the trace highlights this span as the bottleneck.

Learning path

  1. Start: Understand traces, spans, and W3C Trace Context.
  2. Implement: Instrument inbound requests and critical operations.
  3. Propagate: Ensure every outbound call/message carries context.
  4. Correlate: Add trace IDs to logs; practice cross-navigation.
  5. Optimize: Tune sampling; focus on attributes that matter.
  6. Advance: Use span links for async; explore tail-based sampling and redaction for privacy.

Mini challenge

You see three slow traces with total latency ~1.2 s. Spans show:

  • API inbound: 20 ms
  • Auth service call: 80 ms
  • Catalog service call: 150 ms
  • DB query (child of Catalog): 900 ms

Decide on two immediate actions and one longer-term improvement. Write them down and verify against the worked examples.

Next steps

  • Add log correlation for high-severity endpoints.
  • Introduce rule-based or tail-based sampling to retain error/slow traces.
  • Create a runbook: "If latency > X, check spans Y and Z first."
  • Expand to SLO monitoring by mapping slow spans to user-facing latency budgets.

Practice Exercises

2 exercises to complete

Instructions

Given these log lines, group spans into a trace and find the slowest span. Each line has fields: time, level, trace_id, span_id, parent_id, name, duration_ms.

INFO trace=4bf9 span=apiA parent=- name=http.server.request duration=22
INFO trace=4bf9 span=cliB parent=apiA name=http.client.request duration=35
INFO trace=4bf9 span=srvC parent=cliB name=http.server.request duration=40
INFO trace=4bf9 span=dbD  parent=srvC name=db.query duration=410
INFO trace=4bf9 span=cacheE parent=srvC name=cache.get duration=5

Tasks:

  • Identify the root span and draw the tree.
  • Compute the critical path total latency.
  • Name the bottleneck span.
Expected Output
Root: apiA. Tree: apiA -> cliB -> srvC -> {dbD, cacheE}. Critical path ~ apiA(22) + cliB(35) + srvC(40) + dbD(410) ≈ 507 ms. Bottleneck: dbD (db.query).

Distributed Tracing Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Distributed Tracing Basics?

AI Assistant

Ask questions about this tool