Why this matters
As a Platform Engineer, you enable every team to see what their services are doing. Clear standards for logging, metrics, and tracing make incidents faster to resolve, capacity easier to plan, and product experiments simpler to validate.
- During an outage: common fields and trace IDs let you follow a request across services.
- For performance: consistent metric names and units let you set SLOs and alert reliably.
- For compliance: standard redaction rules prevent PII from leaking into logs.
Who this is for
- Platform Engineers defining org-wide observability guidelines.
- Backend Engineers instrumenting services.
- SREs and DevOps engineers building dashboards and alerts.
Prerequisites
- Basic understanding of distributed services (HTTP, gRPC, queues).
- Familiarity with logs, metrics, and traces at a high level.
- Know what an SDK/agent does in your stack (e.g., app-level instrumentation).
Concept explained simply
Standards are the written rules that make telemetry consistent across teams. They define what to send, how to name it, and how to keep it safe and useful. The most common baseline today is OpenTelemetry concepts and semantic conventions (attributes like service.name, http.method, db.system).
Mental model: The 3-layer telemetry cake
- Foundation: Context and identity — consistent
service.name, environment (deployment.environment), version, and correlation IDs (trace_id/span_id). - Flavor: Semantics — shared field names, units, and levels so data lines up across services.
- Frosting: Policy — sampling, retention, and redaction to control cost and protect data.
Core standards to set
Logging standards
- Format: structured JSON; one line per event.
- Required fields:
timestamp(UTC ISO-8601),severity(TRACE/DEBUG/INFO/WARN/ERROR),service.name,deployment.environment(prod/staging/dev),trace_id,span_id,event.name,message. - HTTP attributes (when relevant):
http.method,http.route,http.target,http.status_code,client.address(IP),user_agent.original. - PII and secrets: never log raw emails, tokens, credentials, or card numbers. Redact or hash. Provide a
redaction.rule_idwhen applied. - Cardinality control: no unbounded values in key fields (e.g., avoid user IDs as log levels or logger names).
- Error structure:
exception.type,exception.message,exception.stacktracefor ERROR logs.
Metrics standards
- Naming:
<namespace>.<resource>.<action>(e.g.,http.server.requests). - Types: Counter (monotonic totals), UpDownCounter (can go up/down), Histogram (latency, sizes), Gauge (current value via observation).
- Units: SI units with suffix (e.g.,
seconds,bytes,percent0–100). Include_totalsuffix only when your system expects it; otherwise rely on type metadata. - Labels/attributes: keep low-cardinality. Common dimensions:
service.name,deployment.environment,http.method,http.status_code. - Histograms: define bucket boundaries for latency (e.g., 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s). Choose based on SLOs.
- Alertability: every alert maps to a metric with clear unit and label set; avoid alerts on logs.
Tracing standards
- Context propagation: W3C Trace Context headers (
traceparent,tracestate) across HTTP and messaging. - Naming: spans use operation semantic names (e.g.,
HTTP GET /orders/{id},DB SELECT orders), resources includeservice.nameandservice.version. - Status: set span status
OK/ERRORwithstatus.descriptionon failure. Record exceptions via attributes and events. - Attributes: use semantic conventions:
http.method,http.route,net.peer.name/net.peer.port,db.system,messaging.system, etc. - Sampling: default 10% head sampling for prod, 100% for staging/dev; enable tail-based sampling to capture all error traces or slow outliers (e.g., > 2s).
- Privacy: avoid PII in span attributes; prefer IDs that map internally.
Governance & policy
- Environments:
dev,staging,prodwith clear separation and tags. - Retention: logs 7–30 days based on level; metrics 13–18 months aggregated; traces 7–14 days with error/slow traces longer. Adjust to cost and compliance.
- SLOs: define service SLOs (e.g., 99.9% success < 300ms) and alerting windows (e.g., rolling 30 min).
- Ownership: each
service.namemust have an owning team and escalation policy.
Worked examples
Example 1: Standardized HTTP request log
Goal: a single log line that can be correlated with traces and dashboards.
{
"timestamp": "2025-07-08T14:21:45.531Z",
"severity": "INFO",
"service.name": "checkout-api",
"service.version": "1.12.3",
"deployment.environment": "prod",
"trace_id": "7a6c2c0fbe1d4b7e8b1c2d3e4f5a6789",
"span_id": "9c2d3e4f5a67897a",
"event.name": "http.server.request",
"message": "Handled request",
"http.method": "POST",
"http.route": "/v1/orders",
"http.target": "/v1/orders?promo=redacted",
"http.status_code": 201,
"client.address": "203.0.113.34",
"user_agent.original": "Mozilla/5.0",
"redaction.rule_id": "query-param-default"
}
Why it works: structured JSON, consistent fields, no PII, and trace correlation.
Example 2: Metrics for API latency with SLO buckets
Define a histogram and counters.
Metric: http.server.duration (Histogram, unit=seconds)
Attributes: service.name, deployment.environment, http.method, http.route, http.status_code
Buckets (seconds): 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5
Metric: http.server.requests (Counter)
Attributes: service.name, deployment.environment, http.method, http.route, http.status_code
Why it works: standard names, units, and labels allow percentile SLOs and error-rate alerts.
Example 3: Trace across services with sampling policy
Setup: web calls checkout which calls payments.
- Trace context started at
web, propagated viatraceparent. - Spans:
HTTP GET /buy(web),HTTP POST /v1/orders(checkout),HTTP POST /charge(payments). - Tail sampling rule: keep 100% traces when
http.status_code>= 500 or root span duration > 2s; otherwise 10%.
Outcome: cost control without losing critical failure/latency cases.
Example 4: Preventing high-cardinality metrics
Anti-pattern: http.server.requests{user_id="123"} creates unbounded label values.
Fix: remove user_id label; log it in a structured log when needed; keep metrics focused on low-cardinality dimensions like route and status_code.
Exercises
Do these in your own editor or notebook. The quick test at the end is available to everyone; only logged-in users will have their progress saved.
Exercise 1: Standardize a log event (ex1)
Create a single JSON log line for a successful POST to /v1/orders returning 201 from checkout-api in prod. Include: timestamp (UTC ISO), severity, service.name, service.version, deployment.environment, trace_id, span_id, event.name, message, http.method, http.route, http.target, http.status_code, client.address, user_agent.original, and a redaction.rule_id for any redacted value.
Checklist
- Structured JSON, one line
- Includes trace_id/span_id
- No PII or secrets
- Uses http.* fields and matches naming
Exercise 2: Define metrics and tracing policy (ex2)
For a public HTTP service, define metrics and a sampling policy: one histogram for latency (with buckets), one counter for requests, and tail-based rules for keeping error and slow traces. Include attributes and units. Keep labels low-cardinality.
Checklist
- Histogram unit is seconds
- Attributes include service.name and http.route
- Sampling captures errors and slow outliers
- No user-specific labels
Common mistakes
- Inconsistent names: mixing
service,serviceName,service.name. Standardize onservice.name. - Logging PII: emails, tokens, or full URLs with secrets. Fix with deterministic redaction and review.
- High-cardinality metrics: user IDs or request IDs as labels. Keep identifiers in logs/traces, not metric labels.
- Missing correlation: logs without
trace_id/span_id. Ensure the logger picks up the active span context. - Unbounded log levels: custom levels like
INFO_42. Use the standard level set. - Alerting on logs: prefer metric-based alerts for reliability and cost.
Self-check
- Pick a random service log — can you find
service.name,deployment.environment, andtrace_idin one glance? - Open a latency dashboard — do units match seconds? Are buckets aligned to your SLOs?
- Trace a failed user action — can you follow it across services using the same trace ID?
Learning path
- Adopt common attribute keys:
service.name,service.version,deployment.environment,trace_id,span_id. - Write a one-page standard for logs, metrics, and traces (names, units, levels, sampling).
- Add SDK defaults/templates in your language of choice to enforce the standard.
- Roll out checks in CI to block non-compliant telemetry where feasible.
Practical projects
- Bootstrap an observability starter library that configures logging, metrics, and tracing with your standard fields and sampling.
- Build a “golden dashboard” for a template service: requests, errors, latency percentiles, resource usage.
- Implement a redaction module with unit tests and a
redaction.rule_idattribute. - Create a lint rule that rejects non-standard metric names or disallowed labels.
Next steps
- Map your current services to these standards and note gaps.
- Prioritize fixes that improve incident response (trace propagation, log correlation).
- Schedule a short instrumentation workshop with one product team and iterate.
Mini challenge
Pick one real endpoint in your system. In 30 minutes, instrument it so you can answer: “What is the p95 latency in prod, how many requests fail per 30 minutes, and show me one example trace with ERROR in the last hour?” Write down what you had to add or change in logs, metrics, and traces.
Quick Test
This test is available to everyone. Only logged-in users will have their progress saved.