How to learn Observability Platform for Platform Engineer for free

Why this skill matters for Platform Engineers

As a Platform Engineer, you build and run the backbone that product teams rely on. An effective Observability Platform lets you see how services behave in production, detect issues early, and shorten incident time-to-resolution. Mastering logs, metrics, traces, SLOs, and incident routines unlocks reliable releases, faster debugging, and data-driven improvement.

You’ll enable teams to self-serve dashboards and alerts.
You’ll standardize instrumentation so signals are consistent across services.
You’ll reduce on-call stress with actionable, low-noise alerts and clear runbooks.

Who this is for and prerequisites

Who this is for

Platform/DevOps/SRE engineers building shared infrastructure and tooling.
Backend engineers integrating services with centralized observability.
Team leads aiming to improve reliability and incident response.

Prerequisites

Comfort with Linux, containers, and CI/CD basics.
Familiarity with at least one backend language (Go, Java, Python, or Node.js).
Basic knowledge of HTTP, load balancers, and service-to-service communication.

Learning path: from zero to reliable signals

Adopt standards for logs, metrics, traces
Goal: Establish consistent naming, labels/tags, sampling, and retention. Define conventions once, apply everywhere.
- Pick a metric naming scheme (service, subsystem, unit, label rules).
- Define log format (JSON), severity levels, and required fields (trace/span IDs).
- Choose trace propagation: W3C traceparent/tracestate.
Instrumentation libraries and SDKs
Goal: Integrate OpenTelemetry (or equivalent) across services with minimal friction.
- Wrap HTTP servers/clients with auto-instrumentation.
- Add custom spans for key business operations.
- Emit RED/USE metrics (Requests/Errors/Duration; Utilization/Saturation/Errors).
Dashboards and alerts
Goal: Build actionable visuals and alerts tied to user impact.
SLOs, SLIs, and error budgets
Goal: Align alerts and priorities to reliability goals.
Centralized log management
Goal: Consistent, queryable logs with lifecycle policies.
Tracing context propagation
Goal: End-to-end journey across microservices is traceable.
Incident management workflows
Goal: Clear escalation, triage, and communication patterns.
Postmortems and learning
Goal: Turn incidents into durable improvements.

Worked examples (copy, run, adapt)

1) Add OpenTelemetry to a simple HTTP service (Go)

Sets up tracing and metrics with OTLP; wraps handlers to auto-create spans.

// go.mod: require go.opentelemetry.io/... and go.opentelemetry.io/contrib/... packages
package main
import (
  "context"
  "fmt"
  "log"
  "net/http"
  "go.opentelemetry.io/otel"
  "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
  "go.opentelemetry.io/otel/sdk/resource"
  sdktrace "go.opentelemetry.io/otel/sdk/trace"
  semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
  "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func tp(ctx context.Context) func() {
  exp, _ := otlptracehttp.New(ctx)
  res, _ := resource.New(ctx, resource.WithAttributes(semconv.ServiceName("cart-api")))
  tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp), sdktrace.WithResource(res))
  otel.SetTracerProvider(tp)
  return func() { _ = tp.Shutdown(ctx) }
}
func main() {
  ctx := context.Background()
  shutdown := tp(ctx)
  defer shutdown()
  hello := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintln(w, "ok")
  })
  http.Handle("/healthz", otelhttp.NewHandler(hello, "healthz"))
  log.Fatal(http.ListenAndServe(":8080", nil))
}

Verify by checking traces and metrics in your collector backend.

2) PromQL: 95th percentile latency + alert

Compute p95 for an HTTP duration histogram and alert if too high.

# p95 over 5m window
histogram_quantile(0.95, sum by (le) (rate(http_server_duration_seconds_bucket[5m])))

# Simple alert (example threshold 500ms)
- alert: HighP95Latency
  expr: histogram_quantile(0.95, sum by (le) (rate(http_server_duration_seconds_bucket[5m]))) > 0.5
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "p95 latency > 500ms"
    runbook: "Check upstream dependencies and recent deploys"

3) SLO and burn rate alerts

Suppose an SLO: 99.9% request success over 30 days. SLI is success rate.

# Success ratio (5m)
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Error ratio (5m)
1 - (
  sum(rate(http_requests_total{code!~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
)

# Multi-window, multi-burn-rate (MWMBR) alert example
- alert: FastBurn
  expr: (1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))))
        > (1 - 0.999) * 14.4  # 14.4x fast burn
  for: 5m
  labels: {severity: page}
  annotations: {summary: "Fast error budget burn"}
- alert: SlowBurn
  expr: (1 - (sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))))
        > (1 - 0.999) * 6     # 6x slow burn
  for: 2h
  labels: {severity: ticket}
  annotations: {summary: "Slow error budget burn"}

Adjust multipliers to tune noise vs. sensitivity.

4) Centralized logs with Fluent Bit (ingest, filter, output)

[SERVICE]
  Flush        1
  Parsers_File parsers.conf

[INPUT]
  Name   tail
  Path   /var/log/app/*.log
  Parser json
  Tag    app.*

[FILTER]
  Name        modify
  Match       app.*
  Rename      level severity
  Add         service cart-api

[OUTPUT]
  Name  es
  Match app.*
  Host  elasticsearch
  Port  9200
  Index logs-app

Ensure your app logs in JSON and includes correlation IDs or trace IDs.

5) Trace context propagation (W3C)

Include headers on outbound HTTP calls so downstream spans attach to the same trace.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate:  vendorname=opaque-info

If you see broken trace trees, verify that gateways, queues, and async workers preserve these headers or propagate equivalent context.

Drills and exercises

Define a naming/labeling convention for metrics and apply it to 2 services.
Emit logs in JSON with a consistent field set: timestamp, level, service, trace_id, span_id, msg.
Add a custom span around a database call and record duration, rows, and error status.
Build a dashboard with: request rate, error rate, p95 latency, and dependency health.
Write at least one fast-burn and one slow-burn alert tied to an SLO.
Run a game-day: introduce a fault, confirm alerting and traceability, and capture findings.

Common mistakes and debugging tips

Too many alerts (alert fatigue)

Use SLO-based MWMBR alerts and route non-urgent issues to tickets. Remove duplicate alerts and suppress during planned maintenance.

High metric cardinality

Don’t put user IDs, request IDs, or free-form strings into labels. Prefer aggregation keys like route, status, region.

Uncorrelated logs and traces

Inject trace_id/span_id into log context. Verify middleware adds them and your logger outputs these fields.

Broken trace trees in async flows

Queues and background workers must serialize context (traceparent) and restore it on consumption.

Dashboards not actionable

Lead with high-level SLIs (rate, errors, latency). Link panels by labels so you can drill down by service, version, or region.

Mini project: Observability Starter Pack

Deliver a small but complete kit any service can adopt in an hour.

Create a library/module that initializes OpenTelemetry (traces + metrics) with service name and resource attributes.
Provide logging middleware that injects trace IDs into log context and emits JSON logs.
Publish a dashboard JSON with RED/USE panels and a variable for service name.
Write two alert rules: High error budget burn (fast) and Slow burn (long window).
Document a one-page runbook: how to onboard, verify signals, and troubleshoot missing data.

Acceptance criteria

Installation requires ≤ 10 lines of code change.
Traces show at least one custom span per critical operation.
Dashboards render for any service via a service variable.
Alerts fire with clear, user-impacting messages.

Practical projects to cement skills

Canary rollout with observability gates: Promote only if error rate and p95 stay within SLO for 30 minutes.
Dependency map: Build a service graph from traces and highlight the top 3 contributing services to p95.
Retention policy: Set tiered log retention (hot 7d, warm 30d, cold 90d) and a sampling policy for traces.

Next steps

Roll out the starter pack to one critical and one non-critical service. Compare incident trends after 2 weeks.
Introduce an on-call dry run: synthetic failure with a stopwatch from alert to fix. Capture gaps.
Automate: add dashboards/alerts provisioning to your IaC so new services are observability-ready by default.

Subskills

Logging Metrics Tracing Standards

Define unified formats, labels, naming, sampling, and retention so all teams speak the same observability language.

Instrumentation Libraries And SDKs

Use OpenTelemetry (or equivalent) to capture traces/metrics with minimal code and consistent resource metadata.

Dashboards And Alerts

Design service and fleet dashboards with SLI panels and SLO-aligned alerting that reduces noise.

SLO SLIs And Error Budgets

Set user-centric targets, measure them, and govern reliability with burn-rate alerts and budget policies.

Centralized Log Management

Ship JSON logs to a central store, enrich with context, and implement lifecycle and cost controls.

Tracing Context Propagation

Propagate W3C context across sync/async boundaries for full request journeys.

Incident Management Workflows

Clear roles (incident commander, comms, scribe), triage steps, and communication templates.

Postmortems And Learning

Blameless, factual analyses that drive preventive actions and standards updates.

Learning path reminder

Focus on standards and instrumentation first. Then land dashboards/alerts tied to SLOs. Finally, refine incident and learning loops.

Menu

Observability Platform

Table of Contents

Why this skill matters for Platform Engineers

Who this is for and prerequisites

Learning path: from zero to reliable signals

Worked examples (copy, run, adapt)

Drills and exercises

Common mistakes and debugging tips

Mini project: Observability Starter Pack

Practical projects to cement skills

Next steps

Subskills

Logging Metrics Tracing Standards

Instrumentation Libraries And SDKs

Dashboards And Alerts

SLO SLIs And Error Budgets

Centralized Log Management

Tracing Context Propagation

Incident Management Workflows

Postmortems And Learning

Learning path reminder

Topics

Logging Metrics Tracing Standards

Instrumentation Libraries And SDKs

Dashboards And Alerts

Tracing Context Propagation

Postmortems And Learning

SLO SLIs And Error Budgets

Centralized Log Management

Incident Management Workflows

Have questions about Observability Platform?

AI Assistant