Why this matters
Platform Engineers often need to standardize how services emit telemetry so teams can debug reliably and operate at scale. Instrumentation libraries and SDKs are the fastest, safest way to capture traces, metrics, and logs consistently across languages and frameworks.
- Enable production-safe visibility with sampling and stable resource attributes.
- Help teams adopt automatic instrumentation quickly, then add manual spans where it counts.
- Standardize exporters and context propagation so requests are traceable across microservices.
Concept explained simply
Instrumentation libraries and SDKs are building blocks you add to code or runtimes to collect telemetry:
- Traces: the journey of a request; made of spans (units of work).
- Metrics: numeric time series (counters, gauges, histograms).
- Logs: structured events with context (e.g., trace_id, span_id).
Automatic instrumentation hooks into frameworks (HTTP, DB) to emit spans/metrics without touching your code. Manual instrumentation adds custom spans, attributes, and metrics around critical business logic.
Mental model
Think of telemetry as a labeled package moving through a delivery network:
- Resource: the return address (service.name, version, env).
- Context propagation: the tracking number that follows the package (traceparent).
- SDK: the packing machine that shapes and labels the package.
- Exporter: the truck that sends it to your observability backend.
Core building blocks
- Resource attributes: service.name, service.version, deployment.env.
- Span: name, start/end time, attributes, status, events, links.
- Trace: a tree of spans sharing a trace_id.
- Context propagation: W3C Trace Context or B3; inject/extract into headers.
- Metric types: counter (monotonic), updowncounter, gauge (observed), histogram (latency, sizes).
- Logs: structured, include trace_id/span_id for correlation.
- Exporters: OTLP, console, file; batch processors for performance.
- Sampling: head (before creation) or tail (after collection); balance cost vs detail.
Tip: naming spans and metrics
- Span names: verb + resource (e.g., GET /orders, db.query SELECT).
- Attribute keys: stable and low-cardinality (http.route, db.system, user.tier).
- Metric names: nouns with units (requests.duration.ms, queue.depth).
Worked examples
Example 1: Python API with auto + manual tracing
# Install: pip install opentelemetry-sdk opentelemetry-api opentelemetry-instrumentation-requests opentelemetry-exporter-console
import time
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import Status, StatusCode
resource = Resource.create({
'service.name': 'checkout-api',
'service.version': '1.2.0',
'deployment.environment': 'dev'
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer('checkout-api')
# Manual span around a payment step
with tracer.start_as_current_span('payment.authorize') as span:
span.set_attribute('payment.provider', 'demo')
time.sleep(0.05)
span.set_status(Status(StatusCode.OK))
print('Authorized')
What you get: clear spans with attributes, correlated under one trace when called within a propagated context.
Example 2: Node.js worker metrics with histogram
// Install: npm i @opentelemetry/sdk-metrics @opentelemetry/api @opentelemetry/exporter-metrics-otlp-proto @opentelemetry/exporter-metrics-stdout
const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { ConsoleMetricExporter } = require('@opentelemetry/exporter-metrics-stdout');
const provider = new MeterProvider({
resource: { attributes: { 'service.name': 'billing-worker', 'deployment.environment': 'dev' } }
});
provider.addMetricReader(new PeriodicExportingMetricReader({ exporter: new ConsoleMetricExporter(), exportIntervalMillis: 2000 }));
const meter = provider.getMeter('billing');
const duration = meter.createHistogram('jobs.duration.ms', { description: 'Job execution time' });
function simulateJob() {
const ms = Math.floor(Math.random() * 120) + 30;
duration.record(ms, { job_type: 'invoice' });
}
setInterval(simulateJob, 300);
What you get: periodic histogram exports summarizing job durations, ready to alert on latency shifts.
Example 3: Java manual span with log correlation
// Gradle deps (conceptual): opentelemetry-sdk, opentelemetry-api, exporter-logging
Tracer tracer = GlobalOpenTelemetry.getTracer("orders");
Span span = tracer.spanBuilder("orders.recalculate")
.setAttribute("tenant", "gold")
.startSpan();
try (Scope s = span.makeCurrent()) {
// include trace ids in logs
String traceId = span.getSpanContext().getTraceId();
String spanId = span.getSpanContext().getSpanId();
System.out.println("trace_id=" + traceId + " span_id=" + spanId + " msg=Recalculation started");
// work ...
span.setStatus(StatusCode.OK);
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR);
} finally {
span.end();
}
What you get: logs that carry trace and span IDs, enabling click-through correlation in your backend.
Choosing libraries and SDKs
- Support for your language/runtime and popular frameworks in your org.
- Exporters you need (OTLP recommended), plus console/file for local use.
- Performance: batch processors, async I/O, metrics readers.
- Config via env vars for consistent deployments.
- Stability and semantic conventions that match your standards.
Safe defaults to start
- OTLP exporter over gRPC or HTTP.
- BatchSpanProcessor with modest queue size.
- Head sampling at 5–10% for web traffic; 100% in non-prod.
- Resource attributes: service.name, service.version, deployment.environment.
Implementation steps
- Define resource attributes
Decide on service.name, service.version, deployment.environment. Keep them consistent. - Enable auto-instrumentation
Attach language agents or initialize framework instrumentations for HTTP, DB, messaging. - Add manual spans/metrics
Wrap critical paths (checkout, payment, cache miss) with spans and attributes. - Configure exporters
Start with console exporter locally, then switch to OTLP for shared backends. - Set sampling
Pick head sampling rates; reserve tail sampling for advanced backends. - Propagate context
Use W3C Trace Context across services; test with cross-service requests. - Harden and ship
Add timeouts/retries on exporters, and confirm graceful shutdown flush.
Exercises
Complete these hands-on tasks. Solutions are collapsible. Aim to run locally with console exporters.
- [ex1] Add basic tracing with an SDK and auto-instrumentation.
- [ex2] Emit a custom histogram metric with exemplars or attributes.
[ex1] Add basic tracing with an SDK and auto-instrumentation
Goal: Produce a trace for an HTTP request with a custom child span and attributes.
- Install Python packages: opentelemetry-sdk, opentelemetry-api, opentelemetry-exporter-console, opentelemetry-instrumentation-requests.
- Create a simple handler that calls an external URL or sleeps.
- Initialize a TracerProvider with resource attributes and a BatchSpanProcessor + ConsoleSpanExporter.
- Create a parent span named 'http.request' and a child span 'work.step'.
- Print the trace_id to stdout.
Expected output: console shows at least two spans in one trace with service.name=ex1-service and attributes step='parse'.
Show solution
# pip install opentelemetry-sdk opentelemetry-api opentelemetry-exporter-console opentelemetry-instrumentation-requests
import time
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
resource = Resource.create({
'service.name': 'ex1-service',
'deployment.environment': 'dev'
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer('ex1')
with tracer.start_as_current_span('http.request') as parent:
parent.set_attribute('http.route', '/demo')
with tracer.start_as_current_span('work.step') as child:
child.set_attribute('step', 'parse')
time.sleep(0.02)
tid = parent.get_span_context().trace_id
print('trace_id=', format(tid, '032x'))
[ex2] Emit a custom histogram metric with attributes
Goal: Record request latency and export it to stdout.
- In Node.js, install @opentelemetry/sdk-metrics and @opentelemetry/exporter-metrics-stdout.
- Create a MeterProvider with resource attributes.
- Create a histogram 'requests.duration.ms'.
- Record three values with attribute http.route='/checkout'.
- Verify console output shows histogram data with attributes.
Expected output: printed metric with name requests.duration.ms and attributes http.route=/checkout.
Show solution
// npm i @opentelemetry/sdk-metrics @opentelemetry/exporter-metrics-stdout
const { MeterProvider, PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { ConsoleMetricExporter } = require('@opentelemetry/exporter-metrics-stdout');
const provider = new MeterProvider({ resource: { attributes: { 'service.name': 'ex2-service', 'deployment.environment': 'dev' } } });
provider.addMetricReader(new PeriodicExportingMetricReader({ exporter: new ConsoleMetricExporter(), exportIntervalMillis: 1000 }));
const meter = provider.getMeter('ex2');
const h = meter.createHistogram('requests.duration.ms');
[45, 80, 120].forEach(v => h.record(v, { 'http.route': '/checkout' }));
setTimeout(() => process.exit(0), 1500);
Common mistakes and how to self-check
- High-cardinality attributes (user_id, full URL query). Fix: keep only stable, low-cardinality keys.
- Sampling misconfiguration (0% or 100% by accident). Fix: print current sampling rate at startup; add a health route that reports it.
- Exporter blocking request threads. Fix: use batch/async exporters and timeouts.
- Uncorrelated logs. Fix: inject trace_id/span_id into log context from the current span.
- Duplicate spans from overlapping auto + manual instrumentation. Fix: prefer manual around business logic and disable overlapping auto hooks if needed.
Self-check checklist
- Every service reports service.name, service.version, deployment.environment.
- A cross-service call preserves the same trace_id end-to-end.
- Metrics export without pausing request handling.
- Logs include trace_id and span_id for sampled requests.
- No attribute values explode in cardinality over time.
Practical projects
- Monolith to microservices trace map: instrument two services, ensure a single trace flows across HTTP boundary. Deliverable: screenshot or description of a trace tree with at least 5 spans.
- Latency SLO dashboard: emit requests.duration.ms histogram; compute p95 and alert when over threshold. Deliverable: metric output and alert condition YAML or description.
- Log correlation rollout: inject trace ids into app logs across 2 languages. Deliverable: sample logs that share the same trace_id as a trace export.
Mini challenge
You see many spans named 'GET /{id}' with very high cardinality in an attribute 'user_id'. How do you fix this without losing useful detail?
Show hint/solution
- Keep span name at route template (GET /items/{id}), not full path.
- Remove user_id; replace with user.tier or auth.method (low cardinality).
- If needed, add user_id only as an event on error spans, not as an attribute on all spans.
Learning path
- Before: Observability fundamentals (traces, metrics, logs), W3C Trace Context basics.
- Now: Instrumentation libraries and SDKs (this lesson).
- Next: Collectors/agents, sampling strategies, semantic conventions, alerting and SLOs.
Who this is for
- Platform Engineers standardizing telemetry across services.
- Backend Engineers wiring visibility into APIs, jobs, and workers.
Prerequisites
- Comfort with one programming language (Python, Node.js, Go, or Java).
- Basic understanding of HTTP services and asynchronous jobs.
- Familiarity with environment variables and service configuration.
Next steps
- Instrument one service in dev using console exporters.
- Add manual spans around a critical path.
- Switch to OTLP exporter and verify traces flow through your collector/backend.
- Roll out a standard resource attribute policy across repositories.
Quick Test — Access
Take the quick test to check your understanding. Everyone can take the test; only logged-in users get saved progress.