How to learn Observability Stack Basics for Cloud Basics in Machine Learning Engineer for free

Why this matters

As a Machine Learning Engineer, you ship models that must be fast, reliable, and cost-effective in the cloud. Observability lets you answer: Is my service healthy? Why is it slow? Is the model drifting? Without it, debugging becomes guesswork.

Real tasks you will face: define SLIs/SLOs for inference latency and error rate; investigate spikes in 5xx; track GPU/CPU saturation; correlate a bad release with performance drop; prove that a fix worked.
Outcome: faster incident resolution, safer deploys, data-driven capacity planning, and trustworthy ML systems.

Concept explained simply

Observability is your system's "flight instruments." It uses telemetry to show what is happening and why.

Metrics: numeric time series (e.g., requests per second, 95th percentile latency). Great for trends, SLOs, and alerts.
Logs: structured records of events (e.g., inference result, error details). Great for investigation and audit.
Traces: a request's path across services, with spans and timing. Great for finding slow hops and bottlenecks.
Dashboards: curated views of key metrics for humans.
Alerting: rules that notify on symptoms (latency, error rate) rather than mere causes.

Mental model

Think of three pillars supporting a single story about a request: metrics show the crowd behavior, logs tell individual stories, and traces connect stories into a journey. Correlate everything with IDs (trace_id/request_id) so you can jump between them during an incident.

Core building blocks (ML-flavored)

Structured log example (inference)

{
  "timestamp": "2026-01-01T12:00:00Z",
  "level": "INFO",
  "service": "ml-infer",
  "model_name": "recommender",
  "model_version": "v23",
  "request_id": "98f7-21ab",
  "trace_id": "4fdc1a3c9d8be321",
  "latency_ms": 142,
  "status_code": 200,
  "user_region": "eu-west",
  "input_schema_version": "1.3",
  "features_used": ["age", "country", "session_length"],
  "prediction": 0.87
}

Useful metric names and types

inference_requests_total (counter)
inference_latency_seconds_bucket (histogram for p95/p99)
gpu_utilization_ratio (gauge)
feature_missing_total (counter)
drift_psi (gauge) — Population Stability Index

Trace context basics

Propagate a correlation header like "traceparent" between services.
Span naming: service.operation, e.g., ml-infer.encode_features, ml-infer.predict, ml-infer.postprocess.
Attach attributes: model_name, model_version, cache_hit, batch_id.

SLI/SLO patterns

Latency SLI: p95_inference_latency_seconds
Error rate SLI: 5xx / all requests
Availability SLI: successful requests / all requests
Example SLOs: p95 <= 200ms; error_rate < 1% during business hours

Alerting sanity rules

Page on symptoms (latency, error rate), ticket on causes (disk near-full).
Use short + long window to avoid flapping (e.g., 5m and 1h check).
Add runbook, owner, and severity.

Worked examples

Example 1 — Latency spike during traffic surge

Observe: p95_inference_latency_seconds climbs from 0.12s to 0.35s; error rate steady.
Trace: p95 dominated by ml-infer.encode_features span.
Logs: structured logs show feature store cache_miss=true for many requests; region=us-east.
Hypothesis: cache capacity and warmup insufficient.
Fix: double cache size in us-east; pre-warm top features.
Verify: latency returns to 0.13s; add alert on cache_miss_ratio > 0.3.

Example 2 — Silent model degradation after new release

Metrics: accuracy proxy (accept_rate or calibration proxy) drifts 10% week-over-week; latency unchanged.
Logs: increase in feature_missing_total for session_length.
Trace attributes: model_version changed to v23.1 at same time.
Root cause: pipeline change dropped session_length for a subset of regions.
Fix: restore feature; add pre-deploy check metric feature_presence_ratio by region.

Example 3 — Cost spike from autoscaling

Metrics: gpu_utilization_ratio average 25%, but replicas doubled; requests steady.
Trace: ml-infer.predict shows many short spans; logs show batch_size=1 for offline path due to misconfig.
Action: set batch_size=8 for batchable endpoints; add alert when gpu_utilization_ratio < 0.3 for 30m while replicas >= N.

Hands-on exercises

Do these now. They mirror the listed exercises below and include solutions.

Exercise ex1 — Design SLIs/SLOs and alerts for an ML API

Pick 3 SLIs: latency, error rate, availability.
Set SLO targets (realistic for your use case).
Draft alert rules for p95 latency and error rate with severity levels.

Checklist

SLIs have clear definitions and units
SLO targets are measurable and time-bound
Alerts page on user-impacting symptoms
Each alert includes runbook and owner

Exercise ex2 — Create a structured log schema with correlation

List required fields (service, request_id, trace_id, model_version, latency_ms, status_code).
Add optional ML fields (feature flags, drift, batch_id).
Produce one example success log and one error log.

Checklist

Log lines are valid JSON
No personal data in fields
trace_id/request_id present
Timestamps in ISO 8601

Common mistakes and self-check

High-cardinality labels (e.g., user_id) explode costs and slow queries. Self-check: ensure labels are coarse (region, model_version), not unique per user.
No correlation IDs. Self-check: every request log includes request_id and trace_id.
Unstructured logs. Self-check: JSON with consistent keys and types.
Noisy alerts (false positives). Self-check: include time windows and thresholds tied to user impact; use multiple windows for burn-rate where applicable.
Unlimited retention. Self-check: set retention by data type (e.g., metrics 15–30 days, logs 7–14 days, traces sampled).
Percentiles from averages. Self-check: use histograms for p95/p99; avoid averaging percentiles.

Mini challenge

Scenario: Your recommendation API sees occasional timeouts from one region. In 10 minutes, outline what to check and in what order using metrics, logs, and traces. Keep it to 6 bullet points. Bonus: add one alert you would create to catch this earlier.

Hint

Start with service-level p95 latency by region
Trace a few slow requests and compare span timings
Look for cache_miss_ratio or upstream timeouts

Who this is for

Machine Learning Engineers deploying models to cloud services
Data/Platform Engineers supporting ML inference pipelines
SREs collaborating with ML teams

Prerequisites

Basic cloud deployment knowledge (containers or functions)
Familiarity with HTTP APIs and status codes
Basic understanding of time-series metrics and JSON

Learning path

This subskill: Observability Stack Basics
Next: Deployment and CI/CD for ML services
Then: Model monitoring (drift, data quality) and A/B rollout strategies
Later: Cost optimization and capacity planning

Practical projects

Instrument a toy ML inference service with metrics (requests, latency) and add one SLO dashboard.
Create a structured logging middleware that injects trace_id and model_version.
Build a demo trace with three spans: feature_encode, predict, postprocess; include attributes for cache_hit and model_version.

Next steps

Implement at least one SLO in your current service and wire an alert with a runbook.
Add correlation IDs across your gateway and inference service.
Review metric labels and remove high-cardinality offenders.

Saving your progress

The quick test below is available to everyone. If you log in, your answers and progress will be saved.

Menu

Observability Stack Basics

Table of Contents

Why this matters

Concept explained simply

Core building blocks (ML-flavored)

Worked examples

Hands-on exercises

Exercise ex1 — Design SLIs/SLOs and alerts for an ML API

Exercise ex2 — Create a structured log schema with correlation

Common mistakes and self-check

Mini challenge

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Saving your progress

Practice Exercises

Design SLIs/SLOs and alerts for an ML inference API

Instructions

Expected Output

Create a structured log schema and correlation plan

Observability Stack Basics — Quick Test

Have questions about Observability Stack Basics?

AI Assistant