Why this matters
As a Machine Learning Engineer, you ship models that must be fast, reliable, and cost-effective in the cloud. Observability lets you answer: Is my service healthy? Why is it slow? Is the model drifting? Without it, debugging becomes guesswork.
- Real tasks you will face: define SLIs/SLOs for inference latency and error rate; investigate spikes in 5xx; track GPU/CPU saturation; correlate a bad release with performance drop; prove that a fix worked.
- Outcome: faster incident resolution, safer deploys, data-driven capacity planning, and trustworthy ML systems.
Concept explained simply
Observability is your system's "flight instruments." It uses telemetry to show what is happening and why.
- Metrics: numeric time series (e.g., requests per second, 95th percentile latency). Great for trends, SLOs, and alerts.
- Logs: structured records of events (e.g., inference result, error details). Great for investigation and audit.
- Traces: a request's path across services, with spans and timing. Great for finding slow hops and bottlenecks.
- Dashboards: curated views of key metrics for humans.
- Alerting: rules that notify on symptoms (latency, error rate) rather than mere causes.
Mental model
Think of three pillars supporting a single story about a request: metrics show the crowd behavior, logs tell individual stories, and traces connect stories into a journey. Correlate everything with IDs (trace_id/request_id) so you can jump between them during an incident.
Core building blocks (ML-flavored)
Structured log example (inference)
{
"timestamp": "2026-01-01T12:00:00Z",
"level": "INFO",
"service": "ml-infer",
"model_name": "recommender",
"model_version": "v23",
"request_id": "98f7-21ab",
"trace_id": "4fdc1a3c9d8be321",
"latency_ms": 142,
"status_code": 200,
"user_region": "eu-west",
"input_schema_version": "1.3",
"features_used": ["age", "country", "session_length"],
"prediction": 0.87
}Useful metric names and types
- inference_requests_total (counter)
- inference_latency_seconds_bucket (histogram for p95/p99)
- gpu_utilization_ratio (gauge)
- feature_missing_total (counter)
- drift_psi (gauge) — Population Stability Index
Trace context basics
- Propagate a correlation header like "traceparent" between services.
- Span naming: service.operation, e.g., ml-infer.encode_features, ml-infer.predict, ml-infer.postprocess.
- Attach attributes: model_name, model_version, cache_hit, batch_id.
SLI/SLO patterns
- Latency SLI: p95_inference_latency_seconds
- Error rate SLI: 5xx / all requests
- Availability SLI: successful requests / all requests
- Example SLOs: p95 <= 200ms; error_rate < 1% during business hours
Alerting sanity rules
- Page on symptoms (latency, error rate), ticket on causes (disk near-full).
- Use short + long window to avoid flapping (e.g., 5m and 1h check).
- Add runbook, owner, and severity.
Worked examples
Example 1 — Latency spike during traffic surge
- Observe: p95_inference_latency_seconds climbs from 0.12s to 0.35s; error rate steady.
- Trace: p95 dominated by ml-infer.encode_features span.
- Logs: structured logs show feature store cache_miss=true for many requests; region=us-east.
- Hypothesis: cache capacity and warmup insufficient.
- Fix: double cache size in us-east; pre-warm top features.
- Verify: latency returns to 0.13s; add alert on cache_miss_ratio > 0.3.
Example 2 — Silent model degradation after new release
- Metrics: accuracy proxy (accept_rate or calibration proxy) drifts 10% week-over-week; latency unchanged.
- Logs: increase in feature_missing_total for session_length.
- Trace attributes: model_version changed to v23.1 at same time.
- Root cause: pipeline change dropped session_length for a subset of regions.
- Fix: restore feature; add pre-deploy check metric feature_presence_ratio by region.
Example 3 — Cost spike from autoscaling
- Metrics: gpu_utilization_ratio average 25%, but replicas doubled; requests steady.
- Trace: ml-infer.predict shows many short spans; logs show batch_size=1 for offline path due to misconfig.
- Action: set batch_size=8 for batchable endpoints; add alert when gpu_utilization_ratio < 0.3 for 30m while replicas >= N.
Hands-on exercises
Do these now. They mirror the listed exercises below and include solutions.
Exercise ex1 — Design SLIs/SLOs and alerts for an ML API
- Pick 3 SLIs: latency, error rate, availability.
- Set SLO targets (realistic for your use case).
- Draft alert rules for p95 latency and error rate with severity levels.
Checklist
- SLIs have clear definitions and units
- SLO targets are measurable and time-bound
- Alerts page on user-impacting symptoms
- Each alert includes runbook and owner
Exercise ex2 — Create a structured log schema with correlation
- List required fields (service, request_id, trace_id, model_version, latency_ms, status_code).
- Add optional ML fields (feature flags, drift, batch_id).
- Produce one example success log and one error log.
Checklist
- Log lines are valid JSON
- No personal data in fields
- trace_id/request_id present
- Timestamps in ISO 8601
Common mistakes and self-check
- High-cardinality labels (e.g., user_id) explode costs and slow queries. Self-check: ensure labels are coarse (region, model_version), not unique per user.
- No correlation IDs. Self-check: every request log includes request_id and trace_id.
- Unstructured logs. Self-check: JSON with consistent keys and types.
- Noisy alerts (false positives). Self-check: include time windows and thresholds tied to user impact; use multiple windows for burn-rate where applicable.
- Unlimited retention. Self-check: set retention by data type (e.g., metrics 15–30 days, logs 7–14 days, traces sampled).
- Percentiles from averages. Self-check: use histograms for p95/p99; avoid averaging percentiles.
Mini challenge
Scenario: Your recommendation API sees occasional timeouts from one region. In 10 minutes, outline what to check and in what order using metrics, logs, and traces. Keep it to 6 bullet points. Bonus: add one alert you would create to catch this earlier.
Hint
- Start with service-level p95 latency by region
- Trace a few slow requests and compare span timings
- Look for cache_miss_ratio or upstream timeouts
Who this is for
- Machine Learning Engineers deploying models to cloud services
- Data/Platform Engineers supporting ML inference pipelines
- SREs collaborating with ML teams
Prerequisites
- Basic cloud deployment knowledge (containers or functions)
- Familiarity with HTTP APIs and status codes
- Basic understanding of time-series metrics and JSON
Learning path
- This subskill: Observability Stack Basics
- Next: Deployment and CI/CD for ML services
- Then: Model monitoring (drift, data quality) and A/B rollout strategies
- Later: Cost optimization and capacity planning
Practical projects
- Instrument a toy ML inference service with metrics (requests, latency) and add one SLO dashboard.
- Create a structured logging middleware that injects trace_id and model_version.
- Build a demo trace with three spans: feature_encode, predict, postprocess; include attributes for cache_hit and model_version.
Next steps
- Implement at least one SLO in your current service and wire an alert with a runbook.
- Add correlation IDs across your gateway and inference service.
- Review metric labels and remove high-cardinality offenders.
Saving your progress
The quick test below is available to everyone. If you log in, your answers and progress will be saved.