Why this skill matters for Platform Engineers
As a Platform Engineer, you build and run the backbone that product teams rely on. An effective Observability Platform lets you see how services behave in production, detect issues early, and shorten incident time-to-resolution. Mastering logs, metrics, traces, SLOs, and incident routines unlocks reliable releases, faster debugging, and data-driven improvement.
- You’ll enable teams to self-serve dashboards and alerts.
- You’ll standardize instrumentation so signals are consistent across services.
- You’ll reduce on-call stress with actionable, low-noise alerts and clear runbooks.
Who this is for and prerequisites
Who this is for
- Platform/DevOps/SRE engineers building shared infrastructure and tooling.
- Backend engineers integrating services with centralized observability.
- Team leads aiming to improve reliability and incident response.
Prerequisites
- Comfort with Linux, containers, and CI/CD basics.
- Familiarity with at least one backend language (Go, Java, Python, or Node.js).
- Basic knowledge of HTTP, load balancers, and service-to-service communication.
Learning path: from zero to reliable signals
- Adopt standards for logs, metrics, traces
Goal: Establish consistent naming, labels/tags, sampling, and retention. Define conventions once, apply everywhere.- Pick a metric naming scheme (service, subsystem, unit, label rules).
- Define log format (JSON), severity levels, and required fields (trace/span IDs).
- Choose trace propagation: W3C traceparent/tracestate.
- Instrumentation libraries and SDKs
Goal: Integrate OpenTelemetry (or equivalent) across services with minimal friction.- Wrap HTTP servers/clients with auto-instrumentation.
- Add custom spans for key business operations.
- Emit RED/USE metrics (Requests/Errors/Duration; Utilization/Saturation/Errors).
- Dashboards and alerts
Goal: Build actionable visuals and alerts tied to user impact. - SLOs, SLIs, and error budgets
Goal: Align alerts and priorities to reliability goals. - Centralized log management
Goal: Consistent, queryable logs with lifecycle policies. - Tracing context propagation
Goal: End-to-end journey across microservices is traceable. - Incident management workflows
Goal: Clear escalation, triage, and communication patterns. - Postmortems and learning
Goal: Turn incidents into durable improvements.
Worked examples (copy, run, adapt)
1) Add OpenTelemetry to a simple HTTP service (Go)
Sets up tracing and metrics with OTLP; wraps handlers to auto-create spans.
// go.mod: require go.opentelemetry.io/... and go.opentelemetry.io/contrib/... packages
package main
import (
"context"
"fmt"
"log"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func tp(ctx context.Context) func() {
exp, _ := otlptracehttp.New(ctx)
res, _ := resource.New(ctx, resource.WithAttributes(semconv.ServiceName("cart-api")))
tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp), sdktrace.WithResource(res))
otel.SetTracerProvider(tp)
return func() { _ = tp.Shutdown(ctx) }
}
func main() {
ctx := context.Background()
shutdown := tp(ctx)
defer shutdown()
hello := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintln(w, "ok")
})
http.Handle("/healthz", otelhttp.NewHandler(hello, "healthz"))
log.Fatal(http.ListenAndServe(":8080", nil))
}
Verify by checking traces and metrics in your collector backend.
2) PromQL: 95th percentile latency + alert
Compute p95 for an HTTP duration histogram and alert if too high.
# p95 over 5m window
histogram_quantile(0.95, sum by (le) (rate(http_server_duration_seconds_bucket[5m])))
# Simple alert (example threshold 500ms)
- alert: HighP95Latency
expr: histogram_quantile(0.95, sum by (le) (rate(http_server_duration_seconds_bucket[5m]))) > 0.5
for: 10m
labels:
severity: page
annotations:
summary: "p95 latency > 500ms"
runbook: "Check upstream dependencies and recent deploys"
3) SLO and burn rate alerts
Suppose an SLO: 99.9% request success over 30 days. SLI is success rate.
# Success ratio (5m)
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Error ratio (5m)
1 - (
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
# Multi-window, multi-burn-rate (MWMBR) alert example
- alert: FastBurn
expr: (1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))))
> (1 - 0.999) * 14.4 # 14.4x fast burn
for: 5m
labels: {severity: page}
annotations: {summary: "Fast error budget burn"}
- alert: SlowBurn
expr: (1 - (sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))))
> (1 - 0.999) * 6 # 6x slow burn
for: 2h
labels: {severity: ticket}
annotations: {summary: "Slow error budget burn"}
Adjust multipliers to tune noise vs. sensitivity.
4) Centralized logs with Fluent Bit (ingest, filter, output)
[SERVICE]
Flush 1
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/app/*.log
Parser json
Tag app.*
[FILTER]
Name modify
Match app.*
Rename level severity
Add service cart-api
[OUTPUT]
Name es
Match app.*
Host elasticsearch
Port 9200
Index logs-app
Ensure your app logs in JSON and includes correlation IDs or trace IDs.
5) Trace context propagation (W3C)
Include headers on outbound HTTP calls so downstream spans attach to the same trace.
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendorname=opaque-info
If you see broken trace trees, verify that gateways, queues, and async workers preserve these headers or propagate equivalent context.
Drills and exercises
- Define a naming/labeling convention for metrics and apply it to 2 services.
- Emit logs in JSON with a consistent field set: timestamp, level, service, trace_id, span_id, msg.
- Add a custom span around a database call and record duration, rows, and error status.
- Build a dashboard with: request rate, error rate, p95 latency, and dependency health.
- Write at least one fast-burn and one slow-burn alert tied to an SLO.
- Run a game-day: introduce a fault, confirm alerting and traceability, and capture findings.
Common mistakes and debugging tips
Too many alerts (alert fatigue)
Use SLO-based MWMBR alerts and route non-urgent issues to tickets. Remove duplicate alerts and suppress during planned maintenance.
High metric cardinality
Don’t put user IDs, request IDs, or free-form strings into labels. Prefer aggregation keys like route, status, region.
Uncorrelated logs and traces
Inject trace_id/span_id into log context. Verify middleware adds them and your logger outputs these fields.
Broken trace trees in async flows
Queues and background workers must serialize context (traceparent) and restore it on consumption.
Dashboards not actionable
Lead with high-level SLIs (rate, errors, latency). Link panels by labels so you can drill down by service, version, or region.
Mini project: Observability Starter Pack
Deliver a small but complete kit any service can adopt in an hour.
- Create a library/module that initializes OpenTelemetry (traces + metrics) with service name and resource attributes.
- Provide logging middleware that injects trace IDs into log context and emits JSON logs.
- Publish a dashboard JSON with RED/USE panels and a variable for service name.
- Write two alert rules: High error budget burn (fast) and Slow burn (long window).
- Document a one-page runbook: how to onboard, verify signals, and troubleshoot missing data.
Acceptance criteria
- Installation requires ≤ 10 lines of code change.
- Traces show at least one custom span per critical operation.
- Dashboards render for any service via a service variable.
- Alerts fire with clear, user-impacting messages.
Practical projects to cement skills
- Canary rollout with observability gates: Promote only if error rate and p95 stay within SLO for 30 minutes.
- Dependency map: Build a service graph from traces and highlight the top 3 contributing services to p95.
- Retention policy: Set tiered log retention (hot 7d, warm 30d, cold 90d) and a sampling policy for traces.
Next steps
- Roll out the starter pack to one critical and one non-critical service. Compare incident trends after 2 weeks.
- Introduce an on-call dry run: synthetic failure with a stopwatch from alert to fix. Capture gaps.
- Automate: add dashboards/alerts provisioning to your IaC so new services are observability-ready by default.
Subskills
Logging Metrics Tracing Standards
Define unified formats, labels, naming, sampling, and retention so all teams speak the same observability language.
Instrumentation Libraries And SDKs
Use OpenTelemetry (or equivalent) to capture traces/metrics with minimal code and consistent resource metadata.
Dashboards And Alerts
Design service and fleet dashboards with SLI panels and SLO-aligned alerting that reduces noise.
SLO SLIs And Error Budgets
Set user-centric targets, measure them, and govern reliability with burn-rate alerts and budget policies.
Centralized Log Management
Ship JSON logs to a central store, enrich with context, and implement lifecycle and cost controls.
Tracing Context Propagation
Propagate W3C context across sync/async boundaries for full request journeys.
Incident Management Workflows
Clear roles (incident commander, comms, scribe), triage steps, and communication templates.
Postmortems And Learning
Blameless, factual analyses that drive preventive actions and standards updates.
Learning path reminder
Focus on standards and instrumentation first. Then land dashboards/alerts tied to SLOs. Finally, refine incident and learning loops.