How to learn Logs Metrics Traces Setup for Observability And Monitoring in MLOps Engineer for free

Why this matters

MLOps teams run real-time model APIs and scheduled batch pipelines. When things break or drift, you need to know what, where, and why. A solid setup for logs, metrics, and traces gives you fast detection, clear root-cause signals, and confidence to ship changes.

Real tasks you will face: investigate latency spikes, correlate model version with error rate, track feature retrieval time, prove SLOs to stakeholders, and debug a single user’s failing request.
Outcome of this lesson: you will be able to instrument an ML service and pipeline with structured logs, consistent metrics, and distributed traces.

Concept explained simply

Logs, metrics, and traces each answer a different question:

Logs: exact event details (what happened). Free-form text or structured JSON lines.
Metrics: numeric time series (how it trends). Low cost, power dashboards and alerts.
Traces: request journey (where time is spent). Spans show call graph and timing.

Mental model

Imagine an airport:

Logs are incident reports for specific flights.
Metrics are the dashboard: flights per hour, delays, cancellations.
Traces are a passenger’s path through check-in, security, boarding.

Together, you get both the big picture and the exact steps when you need to zoom in.

Minimal stack blueprint

Emit JSON logs to stdout from your service. Collect with an agent (e.g., a log shipper or OpenTelemetry Collector).
Expose Prometheus/OpenMetrics at /metrics with key counters/gauges/histograms.
Instrument distributed traces with OpenTelemetry SDK. Export to a collector and a backend that supports traces.

Choose a practical baseline

Collector/agent: OpenTelemetry Collector for logs, metrics, and traces.
Metrics scraping: Prometheus-compatible scraper.
Visualization: any dashboarding tool compatible with your metrics and traces backend.

Worked examples

Example 1 — Logs (structured, JSON, contextual)

Goal: structured JSON logs for inference requests with key fields.

{"ts":"2025-06-01T12:34:56Z","level":"INFO","service":"recommender","route":"/predict","model_name":"reco_v2","model_version":"2.3.1","request_id":"9f2c...","latency_ms":42,"status":"ok","features_count":128}

Tips:

Include model_name, model_version, request_id, user_segment (coarse), status, latency_ms.
Avoid high-cardinality PII like user_id; prefer hashed or segmented values.

Example 2 — Metrics (SLO-ready)

Expose these metrics at /metrics:

# HELP inference_requests_total Count of inference requests
# TYPE inference_requests_total counter
inference_requests_total{model_name="reco",model_version="2.3.1",status="ok"} 102934
inference_requests_total{model_name="reco",model_version="2.3.1",status="error"} 142

# HELP inference_latency_seconds Inference latency
# TYPE inference_latency_seconds histogram
inference_latency_seconds_bucket{model_name="reco",model_version="2.3.1",le="0.05"} 50231
inference_latency_seconds_bucket{model_name="reco",model_version="2.3.1",le="0.1"} 78900
inference_latency_seconds_bucket{model_name="reco",model_version="2.3.1",le="+Inf"} 104000
inference_latency_seconds_sum{model_name="reco",model_version="2.3.1"} 6100
inference_latency_seconds_count{model_name="reco",model_version="2.3.1"} 104000

# HELP feature_fetch_seconds Time to fetch features
# TYPE feature_fetch_seconds summary
feature_fetch_seconds{source="redis"} 0.012

# HELP in_flight_requests Current in-flight requests
# TYPE in_flight_requests gauge
in_flight_requests 3

Labels to include: model_name, model_version, status. Keep labels low-cardinality.

Example 3 — Traces (end-to-end)

Instrument spans across the path: HTTP request → feature store → model inference → post-processing.

Trace 7e1c... (root span: POST /predict, 120ms)
  ├─ Span: feature_store.get_features (Redis) 40ms
  ├─ Span: model.infer (ONNXRuntime) 55ms
  └─ Span: postprocess.serialize 10ms
Attributes: model_name=reco, model_version=2.3.1, request_id=9f2c..., user_segment=premium

Propagate context via headers such as traceparent so downstream services continue the trace.

Step-by-step setup (works locally, on VMs, or in Kubernetes)

Step 1 — Emit structured logs

# Python example (FastAPI): structured JSON to stdout
import json, time
from fastapi import FastAPI, Request
app = FastAPI()

@app.get("/predict")
async def predict(request: Request):
    start = time.time()
    # ... run inference ...
    latency_ms = int((time.time() - start) * 1000)
    log = {
      "ts": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
      "level": "INFO",
      "service": "recommender",
      "route": "/predict",
      "model_name": "reco",
      "model_version": "2.3.1",
      "request_id": request.headers.get("x-request-id", "-"),
      "latency_ms": latency_ms,
      "status": "ok"
    }
    print(json.dumps(log))
    return {"ok": True, "latency_ms": latency_ms}

Step 2 — Expose metrics

# Python Prometheus client
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import Response

REQS = Counter('inference_requests_total','requests', ['model_name','model_version','status'])
LAT = Histogram('inference_latency_seconds','latency',['model_name','model_version'], buckets=[0.01,0.025,0.05,0.1,0.25,0.5,1,2,5])
INFLIGHT = Gauge('in_flight_requests','concurrent requests')

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type="text/plain; version=0.0.4")

@app.get("/predict")
async def predict(request: Request):
    INFLIGHT.inc()
    with LAT.labels('reco','2.3.1').time():
        # ... inference ...
        REQS.labels('reco','2.3.1','ok').inc()
    INFLIGHT.dec()
    return {"ok": True}

Step 3 — Add traces

# Python OpenTelemetry minimal setup
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure exporter to your collector endpoint

provider = TracerProvider()
trace.set_tracer_provider(provider)
# provider.add_span_processor(BatchSpanProcessor(YourExporter(...)))
FastAPIInstrumentor.instrument_app(app)

@app.get("/predict")
async def predict(request: Request):
    tracer = trace.get_tracer("recommender")
    with tracer.start_as_current_span("model.infer") as span:
        span.set_attribute("model_name","reco")
        span.set_attribute("model_version","2.3.1")
        # ... inference ...
    return {"ok": True}

Step 4 — Collector config (single pipeline for all signals)

# OpenTelemetry Collector (conceptual snippet)
receivers:
  otlp:
    protocols:
      http:
      grpc:
  prometheus:
    config:
      scrape_configs:
        - job_name: ml-service
          static_configs:
            - targets: ['ml-service:8000']
  filelog:
    include: [/var/log/containers/*.log]

processors:
  batch: {}
  attributes:
    actions:
      - key: service.name
        value: recommender
        action: insert

exporters:
  # Replace with your chosen backends for logs/metrics/traces
  otlp:
    endpoint: your-trace-metrics-endpoint:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp]
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [otlp]
    logs:
      receivers: [filelog]
      processors: [batch, attributes]
      exporters: [otlp]

Step 5 — Sampling strategy

Start with head sampling (e.g., 5–10%) for traces to control cost.
Add tail-based sampling rules to keep slow or error traces.

Step 6 — SLOs and alerts

Availability: error rate ≤ 1% over 30d.
Latency: 95% of requests under 150 ms.

# Example alert rules (conceptual)
# Error burn-alert (fast)
- alert: HighErrorRateFastBurn
  expr: sum by (model_name) (rate(inference_requests_total{status="error"}[5m]))
        / sum by (model_name) (rate(inference_requests_total[5m])) > 0.05
  for: 5m
  labels: {severity: critical}
  annotations: {summary: "High error rate (5m)"}

# Latency: P95 beyond target
- alert: LatencyP95High
  expr: histogram_quantile(0.95, sum by (le, model_name) (rate(inference_latency_seconds_bucket[5m]))) > 0.15
  for: 10m
  labels: {severity: warning}
  annotations: {summary: "P95 latency > 150ms (10m)"}

Step 7 — Dashboards that answer questions

Overview: requests, error %, P95 latency by model_version.
Deep dive: feature_fetch_seconds vs model.infer duration (traces).
Release view: compare last 2 model versions side-by-side.

Common mistakes and self-check

Mistake: unstructured logs. Fix: emit JSON with consistent keys.
Mistake: too many labels (high cardinality). Fix: limit to model_name, model_version, status; aggregate user attributes.
Mistake: only averages. Fix: use histograms and quantiles (P95/P99).
Mistake: missing trace propagation. Fix: ensure traceparent is forwarded across hops.
Mistake: alert fatigue. Fix: SLO-based, multi-window burn-rate alerts.

Self-check:

[ ] Can you search a request by request_id across logs and traces?
[ ] Can you show P95 latency and error rate per model_version?
[ ] Do slow or error requests always produce a trace sample?

Exercises

Do these hands-on tasks. They mirror the exercises below. You can validate locally using any HTTP client and your metrics endpoint.

Exercise 1: Instrument a toy FastAPI model API with JSON logs, Prometheus metrics, and OpenTelemetry traces. Prove that you can search logs by request_id, view inference_requests_total, and see spans for model.infer.
Exercise 2: Create two alert rules: (a) fast-burn error alert using 5m rates; (b) P95 latency alert based on histogram quantile. Test by injecting errors and artificial latency.

[ ] Logs show model_name and model_version for every request
[ ] Metrics endpoint exports counters and histograms
[ ] Traces span across feature fetch and model inference
[ ] Alerts fire on induced errors/latency and then resolve

Practical projects

Blue/Green model rollout: compare metrics and traces between v1 and v2, auto-rollback if P95 latency regresses by 20%.
Data pipeline observability: instrument a batch job with logs for row counts and metrics for job duration; trace sub-steps (extract, transform, load).
Cost-aware sampling: implement head sampling at 10% plus tail rules for errors and spans > 500 ms; verify coverage.

Who this is for

MLOps engineers, platform engineers, and ML engineers operating models in production.
Data engineers adding observability to ML pipelines.

Prerequisites

Comfort with a service framework (e.g., FastAPI/Flask) and Python.
Basic understanding of containers or Linux services.
Familiarity with Prometheus-style metrics and OpenTelemetry basics helps.

Learning path

Start: structured logging and consistent fields.
Add: Prometheus/OpenMetrics with counters, gauges, histograms.
Introduce: tracing with OpenTelemetry and context propagation.
Define: SLOs and alerts (burn-rate, latency quantiles).
Harden: dashboards, sampling strategy, cost control.

Mini challenge

Ship a new model version and prove with metrics and traces that it improves P95 latency by 10% and does not increase error rate. Provide one screenshot-equivalent description: metric query used, quantile chosen, and example trace showing the change in model.infer span duration.

Next steps

Automate: add instrumentation to your service template so every new model starts observable by default.
Governance: document required fields for logs/metrics/traces across all ML services.
Reliability: add synthetic checks that call your API and validate end-to-end spans.

Quick Test

This quick test is available to everyone. Log in to save your progress.

Menu

Logs Metrics Traces Setup

Table of Contents