luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Observability Stack Basics

Learn Observability Stack Basics for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Machine Learning Engineer, you ship models that must be fast, reliable, and cost-effective in the cloud. Observability lets you answer: Is my service healthy? Why is it slow? Is the model drifting? Without it, debugging becomes guesswork.

  • Real tasks you will face: define SLIs/SLOs for inference latency and error rate; investigate spikes in 5xx; track GPU/CPU saturation; correlate a bad release with performance drop; prove that a fix worked.
  • Outcome: faster incident resolution, safer deploys, data-driven capacity planning, and trustworthy ML systems.

Concept explained simply

Observability is your system's "flight instruments." It uses telemetry to show what is happening and why.

  • Metrics: numeric time series (e.g., requests per second, 95th percentile latency). Great for trends, SLOs, and alerts.
  • Logs: structured records of events (e.g., inference result, error details). Great for investigation and audit.
  • Traces: a request's path across services, with spans and timing. Great for finding slow hops and bottlenecks.
  • Dashboards: curated views of key metrics for humans.
  • Alerting: rules that notify on symptoms (latency, error rate) rather than mere causes.
Mental model

Think of three pillars supporting a single story about a request: metrics show the crowd behavior, logs tell individual stories, and traces connect stories into a journey. Correlate everything with IDs (trace_id/request_id) so you can jump between them during an incident.

Core building blocks (ML-flavored)

Structured log example (inference)
{
  "timestamp": "2026-01-01T12:00:00Z",
  "level": "INFO",
  "service": "ml-infer",
  "model_name": "recommender",
  "model_version": "v23",
  "request_id": "98f7-21ab",
  "trace_id": "4fdc1a3c9d8be321",
  "latency_ms": 142,
  "status_code": 200,
  "user_region": "eu-west",
  "input_schema_version": "1.3",
  "features_used": ["age", "country", "session_length"],
  "prediction": 0.87
}
Useful metric names and types
  • inference_requests_total (counter)
  • inference_latency_seconds_bucket (histogram for p95/p99)
  • gpu_utilization_ratio (gauge)
  • feature_missing_total (counter)
  • drift_psi (gauge) — Population Stability Index
Trace context basics
  • Propagate a correlation header like "traceparent" between services.
  • Span naming: service.operation, e.g., ml-infer.encode_features, ml-infer.predict, ml-infer.postprocess.
  • Attach attributes: model_name, model_version, cache_hit, batch_id.
SLI/SLO patterns
  • Latency SLI: p95_inference_latency_seconds
  • Error rate SLI: 5xx / all requests
  • Availability SLI: successful requests / all requests
  • Example SLOs: p95 <= 200ms; error_rate < 1% during business hours
Alerting sanity rules
  • Page on symptoms (latency, error rate), ticket on causes (disk near-full).
  • Use short + long window to avoid flapping (e.g., 5m and 1h check).
  • Add runbook, owner, and severity.

Worked examples

Example 1 — Latency spike during traffic surge
  1. Observe: p95_inference_latency_seconds climbs from 0.12s to 0.35s; error rate steady.
  2. Trace: p95 dominated by ml-infer.encode_features span.
  3. Logs: structured logs show feature store cache_miss=true for many requests; region=us-east.
  4. Hypothesis: cache capacity and warmup insufficient.
  5. Fix: double cache size in us-east; pre-warm top features.
  6. Verify: latency returns to 0.13s; add alert on cache_miss_ratio > 0.3.
Example 2 — Silent model degradation after new release
  1. Metrics: accuracy proxy (accept_rate or calibration proxy) drifts 10% week-over-week; latency unchanged.
  2. Logs: increase in feature_missing_total for session_length.
  3. Trace attributes: model_version changed to v23.1 at same time.
  4. Root cause: pipeline change dropped session_length for a subset of regions.
  5. Fix: restore feature; add pre-deploy check metric feature_presence_ratio by region.
Example 3 — Cost spike from autoscaling
  1. Metrics: gpu_utilization_ratio average 25%, but replicas doubled; requests steady.
  2. Trace: ml-infer.predict shows many short spans; logs show batch_size=1 for offline path due to misconfig.
  3. Action: set batch_size=8 for batchable endpoints; add alert when gpu_utilization_ratio < 0.3 for 30m while replicas >= N.

Hands-on exercises

Do these now. They mirror the listed exercises below and include solutions.

Exercise ex1 — Design SLIs/SLOs and alerts for an ML API

  1. Pick 3 SLIs: latency, error rate, availability.
  2. Set SLO targets (realistic for your use case).
  3. Draft alert rules for p95 latency and error rate with severity levels.
Checklist
  • SLIs have clear definitions and units
  • SLO targets are measurable and time-bound
  • Alerts page on user-impacting symptoms
  • Each alert includes runbook and owner

Exercise ex2 — Create a structured log schema with correlation

  1. List required fields (service, request_id, trace_id, model_version, latency_ms, status_code).
  2. Add optional ML fields (feature flags, drift, batch_id).
  3. Produce one example success log and one error log.
Checklist
  • Log lines are valid JSON
  • No personal data in fields
  • trace_id/request_id present
  • Timestamps in ISO 8601

Common mistakes and self-check

  • High-cardinality labels (e.g., user_id) explode costs and slow queries. Self-check: ensure labels are coarse (region, model_version), not unique per user.
  • No correlation IDs. Self-check: every request log includes request_id and trace_id.
  • Unstructured logs. Self-check: JSON with consistent keys and types.
  • Noisy alerts (false positives). Self-check: include time windows and thresholds tied to user impact; use multiple windows for burn-rate where applicable.
  • Unlimited retention. Self-check: set retention by data type (e.g., metrics 15–30 days, logs 7–14 days, traces sampled).
  • Percentiles from averages. Self-check: use histograms for p95/p99; avoid averaging percentiles.

Mini challenge

Scenario: Your recommendation API sees occasional timeouts from one region. In 10 minutes, outline what to check and in what order using metrics, logs, and traces. Keep it to 6 bullet points. Bonus: add one alert you would create to catch this earlier.

Hint
  • Start with service-level p95 latency by region
  • Trace a few slow requests and compare span timings
  • Look for cache_miss_ratio or upstream timeouts

Who this is for

  • Machine Learning Engineers deploying models to cloud services
  • Data/Platform Engineers supporting ML inference pipelines
  • SREs collaborating with ML teams

Prerequisites

  • Basic cloud deployment knowledge (containers or functions)
  • Familiarity with HTTP APIs and status codes
  • Basic understanding of time-series metrics and JSON

Learning path

  1. This subskill: Observability Stack Basics
  2. Next: Deployment and CI/CD for ML services
  3. Then: Model monitoring (drift, data quality) and A/B rollout strategies
  4. Later: Cost optimization and capacity planning

Practical projects

  • Instrument a toy ML inference service with metrics (requests, latency) and add one SLO dashboard.
  • Create a structured logging middleware that injects trace_id and model_version.
  • Build a demo trace with three spans: feature_encode, predict, postprocess; include attributes for cache_hit and model_version.

Next steps

  • Implement at least one SLO in your current service and wire an alert with a runbook.
  • Add correlation IDs across your gateway and inference service.
  • Review metric labels and remove high-cardinality offenders.

Saving your progress

The quick test below is available to everyone. If you log in, your answers and progress will be saved.

Practice Exercises

2 exercises to complete

Instructions

Pick a single ML API (e.g., recommendations). Define 3 SLIs and set realistic SLO targets. Draft 2 alert rules (latency and error rate) with severities and notes for the runbook owner.

  1. Define SLIs with exact formulas and units.
  2. Choose SLO targets for business hours.
  3. Write alert expressions in pseudo-PromQL (or plain language if you prefer) and define a 5–10 minute evaluation window.
Expected Output
A short spec including 3 SLI definitions, 2 SLO targets, and 2 alert rules with thresholds, window, severity, and runbook note.

Observability Stack Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Observability Stack Basics?

AI Assistant

Ask questions about this tool