Menu

Topic 1 of 8

Logging Metrics Tracing Standards

Learn Logging Metrics Tracing Standards for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Why this matters

As a Platform Engineer, you enable every team to see what their services are doing. Clear standards for logging, metrics, and tracing make incidents faster to resolve, capacity easier to plan, and product experiments simpler to validate.

  • During an outage: common fields and trace IDs let you follow a request across services.
  • For performance: consistent metric names and units let you set SLOs and alert reliably.
  • For compliance: standard redaction rules prevent PII from leaking into logs.

Who this is for

  • Platform Engineers defining org-wide observability guidelines.
  • Backend Engineers instrumenting services.
  • SREs and DevOps engineers building dashboards and alerts.

Prerequisites

  • Basic understanding of distributed services (HTTP, gRPC, queues).
  • Familiarity with logs, metrics, and traces at a high level.
  • Know what an SDK/agent does in your stack (e.g., app-level instrumentation).

Concept explained simply

Standards are the written rules that make telemetry consistent across teams. They define what to send, how to name it, and how to keep it safe and useful. The most common baseline today is OpenTelemetry concepts and semantic conventions (attributes like service.name, http.method, db.system).

Mental model: The 3-layer telemetry cake
  1. Foundation: Context and identity — consistent service.name, environment (deployment.environment), version, and correlation IDs (trace_id/span_id).
  2. Flavor: Semantics — shared field names, units, and levels so data lines up across services.
  3. Frosting: Policy — sampling, retention, and redaction to control cost and protect data.

Core standards to set

Logging standards
  • Format: structured JSON; one line per event.
  • Required fields: timestamp (UTC ISO-8601), severity (TRACE/DEBUG/INFO/WARN/ERROR), service.name, deployment.environment (prod/staging/dev), trace_id, span_id, event.name, message.
  • HTTP attributes (when relevant): http.method, http.route, http.target, http.status_code, client.address (IP), user_agent.original.
  • PII and secrets: never log raw emails, tokens, credentials, or card numbers. Redact or hash. Provide a redaction.rule_id when applied.
  • Cardinality control: no unbounded values in key fields (e.g., avoid user IDs as log levels or logger names).
  • Error structure: exception.type, exception.message, exception.stacktrace for ERROR logs.
Metrics standards
  • Naming: <namespace>.<resource>.<action> (e.g., http.server.requests).
  • Types: Counter (monotonic totals), UpDownCounter (can go up/down), Histogram (latency, sizes), Gauge (current value via observation).
  • Units: SI units with suffix (e.g., seconds, bytes, percent 0–100). Include _total suffix only when your system expects it; otherwise rely on type metadata.
  • Labels/attributes: keep low-cardinality. Common dimensions: service.name, deployment.environment, http.method, http.status_code.
  • Histograms: define bucket boundaries for latency (e.g., 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s). Choose based on SLOs.
  • Alertability: every alert maps to a metric with clear unit and label set; avoid alerts on logs.
Tracing standards
  • Context propagation: W3C Trace Context headers (traceparent, tracestate) across HTTP and messaging.
  • Naming: spans use operation semantic names (e.g., HTTP GET /orders/{id}, DB SELECT orders), resources include service.name and service.version.
  • Status: set span status OK/ERROR with status.description on failure. Record exceptions via attributes and events.
  • Attributes: use semantic conventions: http.method, http.route, net.peer.name/net.peer.port, db.system, messaging.system, etc.
  • Sampling: default 10% head sampling for prod, 100% for staging/dev; enable tail-based sampling to capture all error traces or slow outliers (e.g., > 2s).
  • Privacy: avoid PII in span attributes; prefer IDs that map internally.
Governance & policy
  • Environments: dev, staging, prod with clear separation and tags.
  • Retention: logs 7–30 days based on level; metrics 13–18 months aggregated; traces 7–14 days with error/slow traces longer. Adjust to cost and compliance.
  • SLOs: define service SLOs (e.g., 99.9% success < 300ms) and alerting windows (e.g., rolling 30 min).
  • Ownership: each service.name must have an owning team and escalation policy.

Worked examples

Example 1: Standardized HTTP request log

Goal: a single log line that can be correlated with traces and dashboards.

{
  "timestamp": "2025-07-08T14:21:45.531Z",
  "severity": "INFO",
  "service.name": "checkout-api",
  "service.version": "1.12.3",
  "deployment.environment": "prod",
  "trace_id": "7a6c2c0fbe1d4b7e8b1c2d3e4f5a6789",
  "span_id": "9c2d3e4f5a67897a",
  "event.name": "http.server.request",
  "message": "Handled request",
  "http.method": "POST",
  "http.route": "/v1/orders",
  "http.target": "/v1/orders?promo=redacted",
  "http.status_code": 201,
  "client.address": "203.0.113.34",
  "user_agent.original": "Mozilla/5.0",
  "redaction.rule_id": "query-param-default"
}

Why it works: structured JSON, consistent fields, no PII, and trace correlation.

Example 2: Metrics for API latency with SLO buckets

Define a histogram and counters.

Metric: http.server.duration (Histogram, unit=seconds)
Attributes: service.name, deployment.environment, http.method, http.route, http.status_code
Buckets (seconds): 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5

Metric: http.server.requests (Counter)
Attributes: service.name, deployment.environment, http.method, http.route, http.status_code
    

Why it works: standard names, units, and labels allow percentile SLOs and error-rate alerts.

Example 3: Trace across services with sampling policy

Setup: web calls checkout which calls payments.

  • Trace context started at web, propagated via traceparent.
  • Spans: HTTP GET /buy (web), HTTP POST /v1/orders (checkout), HTTP POST /charge (payments).
  • Tail sampling rule: keep 100% traces when http.status_code >= 500 or root span duration > 2s; otherwise 10%.

Outcome: cost control without losing critical failure/latency cases.

Example 4: Preventing high-cardinality metrics

Anti-pattern: http.server.requests{user_id="123"} creates unbounded label values.

Fix: remove user_id label; log it in a structured log when needed; keep metrics focused on low-cardinality dimensions like route and status_code.

Exercises

Do these in your own editor or notebook. The quick test at the end is available to everyone; only logged-in users will have their progress saved.

Exercise 1: Standardize a log event (ex1)

Create a single JSON log line for a successful POST to /v1/orders returning 201 from checkout-api in prod. Include: timestamp (UTC ISO), severity, service.name, service.version, deployment.environment, trace_id, span_id, event.name, message, http.method, http.route, http.target, http.status_code, client.address, user_agent.original, and a redaction.rule_id for any redacted value.

Checklist
  • Structured JSON, one line
  • Includes trace_id/span_id
  • No PII or secrets
  • Uses http.* fields and matches naming

Exercise 2: Define metrics and tracing policy (ex2)

For a public HTTP service, define metrics and a sampling policy: one histogram for latency (with buckets), one counter for requests, and tail-based rules for keeping error and slow traces. Include attributes and units. Keep labels low-cardinality.

Checklist
  • Histogram unit is seconds
  • Attributes include service.name and http.route
  • Sampling captures errors and slow outliers
  • No user-specific labels

Common mistakes

  • Inconsistent names: mixing service, serviceName, service.name. Standardize on service.name.
  • Logging PII: emails, tokens, or full URLs with secrets. Fix with deterministic redaction and review.
  • High-cardinality metrics: user IDs or request IDs as labels. Keep identifiers in logs/traces, not metric labels.
  • Missing correlation: logs without trace_id/span_id. Ensure the logger picks up the active span context.
  • Unbounded log levels: custom levels like INFO_42. Use the standard level set.
  • Alerting on logs: prefer metric-based alerts for reliability and cost.
Self-check
  • Pick a random service log — can you find service.name, deployment.environment, and trace_id in one glance?
  • Open a latency dashboard — do units match seconds? Are buckets aligned to your SLOs?
  • Trace a failed user action — can you follow it across services using the same trace ID?

Learning path

  1. Adopt common attribute keys: service.name, service.version, deployment.environment, trace_id, span_id.
  2. Write a one-page standard for logs, metrics, and traces (names, units, levels, sampling).
  3. Add SDK defaults/templates in your language of choice to enforce the standard.
  4. Roll out checks in CI to block non-compliant telemetry where feasible.

Practical projects

  • Bootstrap an observability starter library that configures logging, metrics, and tracing with your standard fields and sampling.
  • Build a “golden dashboard” for a template service: requests, errors, latency percentiles, resource usage.
  • Implement a redaction module with unit tests and a redaction.rule_id attribute.
  • Create a lint rule that rejects non-standard metric names or disallowed labels.

Next steps

  • Map your current services to these standards and note gaps.
  • Prioritize fixes that improve incident response (trace propagation, log correlation).
  • Schedule a short instrumentation workshop with one product team and iterate.

Mini challenge

Pick one real endpoint in your system. In 30 minutes, instrument it so you can answer: “What is the p95 latency in prod, how many requests fail per 30 minutes, and show me one example trace with ERROR in the last hour?” Write down what you had to add or change in logs, metrics, and traces.

Quick Test

This test is available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Produce a single structured JSON log line for a successful POST to /v1/orders from checkout-api in prod with 201 status. Include: timestamp (UTC ISO), severity, service.name, service.version, deployment.environment, trace_id, span_id, event.name, message, http.method, http.route, http.target, http.status_code, client.address, user_agent.original. If any sensitive data would appear (e.g., query params), redact it and include redaction.rule_id.

Expected Output
{ "timestamp": "YYYY-MM-DDThh:mm:ss.sssZ", "severity": "INFO", "service.name": "checkout-api", "service.version": "x.y.z", "deployment.environment": "prod", "trace_id": "32-hex", "span_id": "16-hex", "event.name": "http.server.request", "message": "Handled request", "http.method": "POST", "http.route": "/v1/orders", "http.target": "/v1/orders?promo=redacted", "http.status_code": 201, "client.address": "IP", "user_agent.original": "UA", "redaction.rule_id": "query-param-default" }

Logging Metrics Tracing Standards — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Logging Metrics Tracing Standards?

AI Assistant

Ask questions about this tool