Topic Not Found

Why this matters

As a Backend Engineer, you often debug issues that cross multiple services: API Gateway, services, queues, and databases. Distributed tracing shows the full path of a request through these components, with timing for each step. It helps you:

Find the slowest hop causing latency or timeouts.
See failed spans to locate errors quickly.
Correlate logs and metrics with the exact request.
Validate SLIs/SLOs by understanding where time is spent.

Who this is for

Backend and platform engineers building or maintaining microservices.
Engineers moving from logs-only monitoring to full observability.
Developers who need faster incident response and root-cause analysis.

Prerequisites

Basic HTTP knowledge (methods, headers, status codes).
Familiarity with services calling other services and using databases/queues.
Basic understanding of logs and metrics.

Concept explained simply

A trace is the story of one request. It is made of spans, where each span describes one operation (like an HTTP call, a DB query, or a message consume). Spans have timing, status, attributes (tags), and relationships (parent-child or links).

Trace ID: One ID for the whole request path.
Span ID: Unique ID for each operation.
Parent Span ID: Shows how spans connect into a tree.
Context propagation: Passing the trace context across process boundaries via headers (for example, W3C Trace Context).
Sampling: Choosing which traces to record to control cost.

Mental model

Imagine a parcel traveling through sorting facilities. The parcel has one barcode (trace ID). Each facility scan is a span. Scans are linked to show the path and timing. If one facility is slow, you see the delay on that span.

Key terms you will use

W3C Trace Context: Standard headers: traceparent and tracestate for propagating trace IDs across services.
Attributes/Tags: Key–value pairs on spans (e.g., http.method, db.system) for filtering and analysis.
Events/Logs-in-span: Timestamped notes inside a span (e.g., "retry started").
Links: Connect a span to related spans when there is no strict parent-child (common with async messaging).

Worked examples

Example 1 — Simple HTTP chain: API → User Service → DB

Flow: Client calls API → API calls User Service → User Service queries DB.

Trace ID: a single ID, e.g., 4bf92f3577b34da6a3ce929d0e0e4736
Spans:
- Span A (API inbound): http.server.request, parent: none (root).
- Span B (API → User Service): http.client.request, parent: A.
- Span C (User Service inbound): http.server.request, parent: B.
- Span D (User Service → DB): db.query, parent: C.
Diagnosing latency: If D takes 400 ms while others are small, the DB is the bottleneck.

Example header (W3C traceparent):

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Example 2 — Async messaging: Order Service → Queue → Billing Worker

Flow: Order Service publishes a message; Billing Worker consumes it later.

Order Service creates Span O (publish). It injects context into message headers.
Billing Worker creates Span B (consume). Parent-child may be ambiguous. Use a link from B to O if processing is decoupled.
This preserves end-to-end visibility without forcing a strict parent-child across async boundaries.

Example 3 — Error tracing and fast root cause

Flow: API → Inventory → DB. Inventory throws a timeout.

Span I (Inventory inbound) status = ERROR, attribute error.type = timeout.
Child Span Q (DB query) shows 499 ms duration and error = deadline_exceeded.
Root cause: DB query duration exceeded Inventory’s timeout budget.
Fix: Increase timeout or add an index to speed up the query.

Example 4 — Sampling impact

Head sampling at 10%: Only 1 in 10 requests record full traces. If you miss rare errors, consider tail-based sampling to keep slow/error traces while dropping the rest.

How to implement (step-by-step)

Pick a standard: Use W3C Trace Context for cross-service propagation.
Instrument inbound edges: Create a server span for each incoming request or message.
Propagate context: Inject and extract trace context in outgoing HTTP calls and messages.
Instrument critical operations: DB queries, cache calls, external APIs. Add useful attributes.
Set sampling: Start with a small percentage; adjust based on traffic and budget.
Correlate logs: Include trace_id and span_id in log lines for easy pivoting.

W3C Trace Context quick reference

traceparent format: version-trace-id-span-id-flags

00---
Example:
00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

flags: bit 0 indicates sampling (01 = sampled, 00 = not sampled).
tracestate: optional vendor-specific data; keep as-is unless you need it.

Exercises

Hands-on practice. You can do these in a text editor or on paper. There is a quick test at the end. Everyone can take the test; only logged-in users will have progress saved.

Exercise 1 — Reconstruct a trace from logs

See the exercise card in this section and in the Exercises panel below.

Goal: Determine the critical path and slowest span.
Checklist to complete:
- Identify trace_id from each log line.
- Group spans by parent-child.
- Compute durations to find the bottleneck.

Exercise 2 — Parse traceparent

See the exercise card in this section and in the Exercises panel below.

Goal: Extract version, trace-id, span-id, and flags.
Checklist to complete:
- Split on dashes and validate lengths.
- Mark sampled if flags LSB is 1.
- Handle invalid input gracefully.

Common mistakes and how to self-check

Missing context propagation: Symptoms: traces break into separate roots. Self-check: ensure outgoing calls include traceparent; inbound extracts it.
Over-instrumentation: Too many spans make traces noisy. Self-check: every span should answer a diagnostic question; remove low-value spans.
Attributes too generic: Lacking http.method/path or db.system makes filtering hard. Self-check: add key attributes that aid triage.
Sampling hides issues: Rare errors not captured. Self-check: use rules to keep error/slow traces (tail-based or rules-based sampling).
No log correlation: Hard to drill down. Self-check: ensure logs carry trace_id and span_id.

Practical projects

Instrument a two-service demo: Service A calls Service B and a database. Add spans with attributes. Verify the span tree and timings.
Add async tracing: Publish an event from Service A; consume in Service C. Use span links to relate publish and process.
Introduce a 300 ms artificial delay in DB calls; confirm the trace highlights this span as the bottleneck.

Learning path

Start: Understand traces, spans, and W3C Trace Context.
Implement: Instrument inbound requests and critical operations.
Propagate: Ensure every outbound call/message carries context.
Correlate: Add trace IDs to logs; practice cross-navigation.
Optimize: Tune sampling; focus on attributes that matter.
Advance: Use span links for async; explore tail-based sampling and redaction for privacy.

Mini challenge

You see three slow traces with total latency ~1.2 s. Spans show:

API inbound: 20 ms
Auth service call: 80 ms
Catalog service call: 150 ms
DB query (child of Catalog): 900 ms

Decide on two immediate actions and one longer-term improvement. Write them down and verify against the worked examples.

Next steps

Add log correlation for high-severity endpoints.
Introduce rule-based or tail-based sampling to retain error/slow traces.
Create a runbook: "If latency > X, check spans Y and Z first."
Expand to SLO monitoring by mapping slow spans to user-facing latency budgets.

Menu

Distributed Tracing Basics

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Key terms you will use

Worked examples

How to implement (step-by-step)

Exercises

Exercise 1 — Reconstruct a trace from logs

Exercise 2 — Parse traceparent

Common mistakes and how to self-check

Practical projects

Learning path

Mini challenge

Next steps

Practice Exercises

Reconstruct a trace from log lines

Instructions

Expected Output

Parse a W3C traceparent header

Distributed Tracing Basics — Quick Test

Have questions about Distributed Tracing Basics?

AI Assistant