Skill Not Found

Why observability and monitoring matter for MLOps Engineers

Models that perform in a notebook can fail in production for many reasons: data drift, slow dependencies, bad rollouts, and infrastructure limits. Observability and monitoring give you the signals to see, understand, and fix issues before users feel them. As an MLOps Engineer, you will instrument pipelines and services, define SLOs, create meaningful alerts, control costs, and lead incident response with data, not guesses.

What you will be able to do

Instrument ML services and pipelines with logs, metrics, and traces.
Define SLIs/SLOs and manage error budgets.
Detect failures, delays, and data issues quickly with actionable alerts.
Monitor cost and plan capacity for predictable performance.
Run effective incident triage and blameless postmortems.

Who this is for

MLOps Engineers implementing and running ML in production.
Data/ML Engineers who own batch or real-time ML pipelines.
Backend Engineers adding model inference to services.

Prerequisites

Comfortable with Python and basic CLI.
Familiar with containers and services (e.g., Docker, basic deployment flow).
Basic understanding of ML lifecycle (training, validation, deployment).

Learning path (practical roadmap)

Milestone 1 — Instrumentation basics (logs, metrics, traces)

Add structured logs with request IDs, model version, and status.
Expose key metrics: request count, latency, errors, and batch durations.
Propagate trace context across components and record spans for inference.

Milestone 2 — SLOs and error budgets

Define SLIs for availability, latency, freshness, and data quality.
Set SLO targets and compute error budgets.
Create burn-rate alerts for fast and slow detection.

Milestone 3 — Service and pipeline health

Build dashboards for real-time inference and batch pipelines.
Track feature freshness, queue length, retries, and drift signals.
Establish runbooks for top failure scenarios.

Milestone 4 — Alerting and on-call readiness

Write alerts that are SLO-aligned, deduplicated, and routed correctly.
Test alerts with synthetic failures and delayed jobs.
Set escalation paths and quiet hours policies.

Milestone 5 — Cost and capacity

Measure cost per prediction and per batch run.
Project capacity using traffic and latency targets.
Automate scaling and budget alarms.

Milestone 6 — Incidents and postmortems

Standardize triage steps and data you collect during incidents.
Run blameless postmortems with clear action items and owners.
Track follow-ups to completion.

Core concepts you will use

Logs: Structured, contextual, queryable lines for debugging. Include keys like request_id, model_version, status.
Metrics: Numeric time series (counters, gauges, histograms) for fast, aggregated signals.
Traces: End-to-end request flows showing spans across services; crucial for pinpointing latency and failures.
SLI/SLO/SLA: SLI is the measurement, SLO the target, and SLA the external commitment.
Error budget: 1 − SLO. Budgets your allowable unreliability and focuses work.
Burn rate: How quickly you consume error budget relative to the period.
Health signals: The four golden signals — latency, traffic, errors, saturation — plus ML-specific freshness and data drift.

Worked examples (copy and adapt)

1) Python inference service — metrics for latency and errors

Expose Prometheus metrics to track throughput, errors, and latency buckets.

from time import time, sleep
from random import random
from prometheus_client import start_http_server, Counter, Histogram

REQUESTS = Counter('inference_requests_total', 'Total inference requests', ['model_version'])
ERRORS = Counter('inference_errors_total', 'Total inference errors', ['model_version', 'reason'])
LATENCY = Histogram('inference_latency_seconds', 'Inference latency', ['model_version'], buckets=[0.01,0.02,0.05,0.1,0.2,0.5,1,2,5])

MODEL_VERSION = 'v1.3.0'

def predict(payload):
    start = time()
    try:
        # simulate work
        sleep(random()*0.05)
        if random() < 0.02:
            raise ValueError('model_timeout')
        return {'ok': True}
    except Exception as e:
        ERRORS.labels(model_version=MODEL_VERSION, reason=str(e)).inc()
        raise
    finally:
        LATENCY.labels(model_version=MODEL_VERSION).observe(time() - start)
        REQUESTS.labels(model_version=MODEL_VERSION).inc()

if __name__ == '__main__':
    start_http_server(8000)  # exposes /metrics
    while True:
        try:
            predict({'x': 1})
        except Exception:
            pass
        sleep(0.01)

Dashboards: p50/p95 latency by model_version, errors_total by reason, requests_total rate, and saturation (CPU/memory) from your runtime.

2) Add tracing with OpenTelemetry-style spans

Trace inference to find slow dependencies. Ensure you propagate context across services.

from time import sleep
from random import random
from opentelemetry import trace

tracer = trace.get_tracer("inference")

def featurize(x):
    with tracer.start_as_current_span("featurize") as span:
        sleep(random()*0.01)
        span.set_attribute("feature_count", 12)
        return [x]

def score(features):
    with tracer.start_as_current_span("score") as span:
        sleep(random()*0.02)
        span.set_attribute("model_version", "v1.3.0")
        return 0.42

def predict(x):
    with tracer.start_as_current_span("predict") as span:
        f = featurize(x)
        y = score(f)
        span.set_attribute("success", True)
        return y

Useful span attributes: model_version, endpoint, caller, cache_hit, success, and data_source. Avoid high-cardinality values like user_email.

3) Compute SLO and error budget burn alerts

Suppose monthly SLO availability is 99.5%. Error budget = 0.5% of the period.

days = 30
minutes_in_month = days * 24 * 60  # 43200
slo = 0.995
error_budget_minutes = (1 - slo) * minutes_in_month  # 216

# Fast-burn alert (2h window) if burn_rate >= 14
# Slow-burn alert (24h window) if burn_rate >= 7
# burn_rate = (budget_consumed_fraction) / (window_fraction_of_period)

Use two windows to catch both rapid and gradual degradation without paging for noise.

4) Batch pipeline delay detection

Detect when a daily run is late (freshness SLI).

from datetime import datetime, timedelta

SCHEDULE = {"hour": 1, "minute": 0}
ALLOWED_LATENESS_MIN = 30

last_run_started_at = None
last_run_completed_at = None

def is_late(now: datetime) -> bool:
    scheduled = now.replace(hour=SCHEDULE["hour"], minute=SCHEDULE["minute"], second=0, microsecond=0)
    if now < scheduled:
        scheduled = scheduled - timedelta(days=1)
    if last_run_started_at is None:
        return now > scheduled + timedelta(minutes=ALLOWED_LATENESS_MIN)
    return last_run_started_at > scheduled + timedelta(minutes=ALLOWED_LATENESS_MIN)

Alert if current time is beyond schedule + allowed lateness and no successful completion recorded.

5) Cost monitoring and capacity planning

Track cost per 1000 predictions and plan capacity.

# Example math for capacity planning
qps = 200  # requests per second at p95
p95_latency = 0.05  # 50ms
concurrency_needed = qps * p95_latency  # Little's Law approx

# If each instance handles 10 concurrent requests at target utilization
instances = int((concurrency_needed / 10) * 1.2)  # 20% headroom

# Cost metric example (pseudo): cost_per_1k_preds = instance_hourly_cost * instance_hours / (predictions/1000)

Expose metrics like cost_per_1k_predictions and instance_utilization to guide autoscaling and budgets.

Drills and exercises

[ ] Add a histogram latency metric with model_version label and confirm buckets aggregate correctly.
[ ] Create two burn-rate alerts (fast 2h, slow 24h) for your primary SLO.
[ ] Build a dashboard panel showing batch freshness vs. target freshness and color it by SLO compliance.
[ ] Tag 3 spans with attributes that help debug latency (endpoint, cache_hit, downstream_service).
[ ] Record a unit cost metric (e.g., cost per 1000 predictions) and graph it by model_version.
[ ] Write a one-page runbook for “pipeline delayed > 30 minutes”.

Common mistakes and debugging tips

Too many high-cardinality labels: Avoid user_id, email, or request_id on metrics; keep those in logs or traces.
Alerting on every spike: Tie alerts to SLOs and burn rate, not single-sample thresholds.
Missing freshness SLIs: Batch systems often fail quietly; add explicit schedules and lateness checks.
Ignoring model version: Always label metrics and traces with model_version to correlate rollouts with regressions.
No postmortem follow-up: Track action items with owners and due dates; revisit until closed.
Silent costs: Ship cost and utilization metrics; include them in release checklists.

Quick debugging checklist

Check golden signals: errors, latency, traffic, saturation.
Compare before/after deployment by model_version.
Follow one slow trace end-to-end to find the slowest span.
Inspect last successful batch run vs. failed one: duration, retries, input size, freshness.
Validate data contract changes in upstream sources.

Mini project: End-to-end observability for an ML inference + batch pipeline

Goal: Implement observability for a toy inference service and a daily batch pipeline, including SLOs, alerts, cost metrics, and an incident runbook.

Service: Add request, error, and latency metrics with model_version; add spans for featurize/score.
Batch: Emit metrics for records processed, duration, success flag, and output freshness.
SLOs: Define availability and latency SLOs for the service; freshness SLO for the batch.
Alerts: Burn-rate alerts for availability; freshness alert for batch lateness > 30 minutes.
Cost: Track cost_per_1k_predictions and batch_cost_per_run (mock values are fine).
Runbook: Write triage steps for “service latency regression” and “batch run delayed”.

Success criteria

Dashboards show p95 latency, errors, throughput, freshness, and cost metrics.
Alerts fire when you simulate latency spikes or delayed runs.
You can trace a single request and see model_version on spans.

Subskills

Logs Metrics Traces Setup — Instrument apps with structured logs, key metrics, and distributed traces using standard labels.
SLO SLA Error Budget Thinking — Define SLIs/SLOs, compute budgets, and design burn-rate alerts.
Service Health Monitoring — Build dashboards for latency, availability, errors, and saturation.
Pipeline Health Monitoring — Monitor batch duration, success, freshness, and data drift indicators.
Alerting On Failures And Delays — Create actionable, deduplicated alerts for failures and lateness.
Cost Monitoring And Capacity Planning — Track unit costs and forecast capacity with headroom.
Incident Triage And Postmortems — Standardize triage, run blameless postmortems, and close action items.

Next steps

Complete the drills, then build the mini project to integrate multiple subskills.
When ready, take the skill exam below. Anyone can take it for free; logged-in users get progress saved.
Apply the same patterns to your current ML services and pipelines incrementally.

Skill exam

This exam checks practical understanding of instrumentation, SLOs, alerting, costs, and incident handling. Your results are available to everyone for free; sign in to save progress.

Menu

Observability And Monitoring

Table of Contents

Why observability and monitoring matter for MLOps Engineers

What you will be able to do

Who this is for

Prerequisites

Learning path (practical roadmap)

Core concepts you will use

Worked examples (copy and adapt)

Drills and exercises

Common mistakes and debugging tips

Mini project: End-to-end observability for an ML inference + batch pipeline

Subskills

Next steps

Skill exam

Topics

Cost Monitoring And Capacity Planning

Logs Metrics Traces Setup

SLO SLA Error Budget Thinking

Pipeline Health Monitoring

Service Health Monitoring

Alerting On Failures And Delays

Incident Triage And Postmortems

Have questions about Observability And Monitoring?

AI Assistant