Why this matters
MLOps teams run real-time model APIs and scheduled batch pipelines. When things break or drift, you need to know what, where, and why. A solid setup for logs, metrics, and traces gives you fast detection, clear root-cause signals, and confidence to ship changes.
- Real tasks you will face: investigate latency spikes, correlate model version with error rate, track feature retrieval time, prove SLOs to stakeholders, and debug a single user’s failing request.
- Outcome of this lesson: you will be able to instrument an ML service and pipeline with structured logs, consistent metrics, and distributed traces.
Concept explained simply
Logs, metrics, and traces each answer a different question:
- Logs: exact event details (what happened). Free-form text or structured JSON lines.
- Metrics: numeric time series (how it trends). Low cost, power dashboards and alerts.
- Traces: request journey (where time is spent). Spans show call graph and timing.
Mental model
Imagine an airport:
- Logs are incident reports for specific flights.
- Metrics are the dashboard: flights per hour, delays, cancellations.
- Traces are a passenger’s path through check-in, security, boarding.
Together, you get both the big picture and the exact steps when you need to zoom in.
Minimal stack blueprint
- Emit JSON logs to stdout from your service. Collect with an agent (e.g., a log shipper or OpenTelemetry Collector).
- Expose Prometheus/OpenMetrics at
/metricswith key counters/gauges/histograms. - Instrument distributed traces with OpenTelemetry SDK. Export to a collector and a backend that supports traces.
Choose a practical baseline
- Collector/agent: OpenTelemetry Collector for logs, metrics, and traces.
- Metrics scraping: Prometheus-compatible scraper.
- Visualization: any dashboarding tool compatible with your metrics and traces backend.
Worked examples
Example 1 — Logs (structured, JSON, contextual)
Goal: structured JSON logs for inference requests with key fields.
{"ts":"2025-06-01T12:34:56Z","level":"INFO","service":"recommender","route":"/predict","model_name":"reco_v2","model_version":"2.3.1","request_id":"9f2c...","latency_ms":42,"status":"ok","features_count":128}
Tips:
- Include
model_name,model_version,request_id,user_segment(coarse),status,latency_ms. - Avoid high-cardinality PII like
user_id; prefer hashed or segmented values.
Example 2 — Metrics (SLO-ready)
Expose these metrics at /metrics:
# HELP inference_requests_total Count of inference requests
# TYPE inference_requests_total counter
inference_requests_total{model_name="reco",model_version="2.3.1",status="ok"} 102934
inference_requests_total{model_name="reco",model_version="2.3.1",status="error"} 142
# HELP inference_latency_seconds Inference latency
# TYPE inference_latency_seconds histogram
inference_latency_seconds_bucket{model_name="reco",model_version="2.3.1",le="0.05"} 50231
inference_latency_seconds_bucket{model_name="reco",model_version="2.3.1",le="0.1"} 78900
inference_latency_seconds_bucket{model_name="reco",model_version="2.3.1",le="+Inf"} 104000
inference_latency_seconds_sum{model_name="reco",model_version="2.3.1"} 6100
inference_latency_seconds_count{model_name="reco",model_version="2.3.1"} 104000
# HELP feature_fetch_seconds Time to fetch features
# TYPE feature_fetch_seconds summary
feature_fetch_seconds{source="redis"} 0.012
# HELP in_flight_requests Current in-flight requests
# TYPE in_flight_requests gauge
in_flight_requests 3
Labels to include: model_name, model_version, status. Keep labels low-cardinality.
Example 3 — Traces (end-to-end)
Instrument spans across the path: HTTP request → feature store → model inference → post-processing.
Trace 7e1c... (root span: POST /predict, 120ms)
├─ Span: feature_store.get_features (Redis) 40ms
├─ Span: model.infer (ONNXRuntime) 55ms
└─ Span: postprocess.serialize 10ms
Attributes: model_name=reco, model_version=2.3.1, request_id=9f2c..., user_segment=premium
Propagate context via headers such as traceparent so downstream services continue the trace.
Step-by-step setup (works locally, on VMs, or in Kubernetes)
Step 1 — Emit structured logs
# Python example (FastAPI): structured JSON to stdout
import json, time
from fastapi import FastAPI, Request
app = FastAPI()
@app.get("/predict")
async def predict(request: Request):
start = time.time()
# ... run inference ...
latency_ms = int((time.time() - start) * 1000)
log = {
"ts": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"level": "INFO",
"service": "recommender",
"route": "/predict",
"model_name": "reco",
"model_version": "2.3.1",
"request_id": request.headers.get("x-request-id", "-"),
"latency_ms": latency_ms,
"status": "ok"
}
print(json.dumps(log))
return {"ok": True, "latency_ms": latency_ms}
Step 2 — Expose metrics
# Python Prometheus client
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import Response
REQS = Counter('inference_requests_total','requests', ['model_name','model_version','status'])
LAT = Histogram('inference_latency_seconds','latency',['model_name','model_version'], buckets=[0.01,0.025,0.05,0.1,0.25,0.5,1,2,5])
INFLIGHT = Gauge('in_flight_requests','concurrent requests')
@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type="text/plain; version=0.0.4")
@app.get("/predict")
async def predict(request: Request):
INFLIGHT.inc()
with LAT.labels('reco','2.3.1').time():
# ... inference ...
REQS.labels('reco','2.3.1','ok').inc()
INFLIGHT.dec()
return {"ok": True}
Step 3 — Add traces
# Python OpenTelemetry minimal setup
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure exporter to your collector endpoint
provider = TracerProvider()
trace.set_tracer_provider(provider)
# provider.add_span_processor(BatchSpanProcessor(YourExporter(...)))
FastAPIInstrumentor.instrument_app(app)
@app.get("/predict")
async def predict(request: Request):
tracer = trace.get_tracer("recommender")
with tracer.start_as_current_span("model.infer") as span:
span.set_attribute("model_name","reco")
span.set_attribute("model_version","2.3.1")
# ... inference ...
return {"ok": True}
Step 4 — Collector config (single pipeline for all signals)
# OpenTelemetry Collector (conceptual snippet)
receivers:
otlp:
protocols:
http:
grpc:
prometheus:
config:
scrape_configs:
- job_name: ml-service
static_configs:
- targets: ['ml-service:8000']
filelog:
include: [/var/log/containers/*.log]
processors:
batch: {}
attributes:
actions:
- key: service.name
value: recommender
action: insert
exporters:
# Replace with your chosen backends for logs/metrics/traces
otlp:
endpoint: your-trace-metrics-endpoint:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [otlp]
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlp]
logs:
receivers: [filelog]
processors: [batch, attributes]
exporters: [otlp]
Step 5 — Sampling strategy
- Start with head sampling (e.g., 5–10%) for traces to control cost.
- Add tail-based sampling rules to keep slow or error traces.
Step 6 — SLOs and alerts
- Availability: error rate ≤ 1% over 30d.
- Latency: 95% of requests under 150 ms.
# Example alert rules (conceptual)
# Error burn-alert (fast)
- alert: HighErrorRateFastBurn
expr: sum by (model_name) (rate(inference_requests_total{status="error"}[5m]))
/ sum by (model_name) (rate(inference_requests_total[5m])) > 0.05
for: 5m
labels: {severity: critical}
annotations: {summary: "High error rate (5m)"}
# Latency: P95 beyond target
- alert: LatencyP95High
expr: histogram_quantile(0.95, sum by (le, model_name) (rate(inference_latency_seconds_bucket[5m]))) > 0.15
for: 10m
labels: {severity: warning}
annotations: {summary: "P95 latency > 150ms (10m)"}
Step 7 — Dashboards that answer questions
- Overview: requests, error %, P95 latency by model_version.
- Deep dive: feature_fetch_seconds vs model.infer duration (traces).
- Release view: compare last 2 model versions side-by-side.
Common mistakes and self-check
- Mistake: unstructured logs. Fix: emit JSON with consistent keys.
- Mistake: too many labels (high cardinality). Fix: limit to model_name, model_version, status; aggregate user attributes.
- Mistake: only averages. Fix: use histograms and quantiles (P95/P99).
- Mistake: missing trace propagation. Fix: ensure
traceparentis forwarded across hops. - Mistake: alert fatigue. Fix: SLO-based, multi-window burn-rate alerts.
Self-check:
- [ ] Can you search a request by request_id across logs and traces?
- [ ] Can you show P95 latency and error rate per model_version?
- [ ] Do slow or error requests always produce a trace sample?
Exercises
Do these hands-on tasks. They mirror the exercises below. You can validate locally using any HTTP client and your metrics endpoint.
- Exercise 1: Instrument a toy FastAPI model API with JSON logs, Prometheus metrics, and OpenTelemetry traces. Prove that you can search logs by
request_id, viewinference_requests_total, and see spans formodel.infer. - Exercise 2: Create two alert rules: (a) fast-burn error alert using 5m rates; (b) P95 latency alert based on histogram quantile. Test by injecting errors and artificial latency.
- [ ] Logs show model_name and model_version for every request
- [ ] Metrics endpoint exports counters and histograms
- [ ] Traces span across feature fetch and model inference
- [ ] Alerts fire on induced errors/latency and then resolve
Practical projects
- Blue/Green model rollout: compare metrics and traces between v1 and v2, auto-rollback if P95 latency regresses by 20%.
- Data pipeline observability: instrument a batch job with logs for row counts and metrics for job duration; trace sub-steps (extract, transform, load).
- Cost-aware sampling: implement head sampling at 10% plus tail rules for errors and spans > 500 ms; verify coverage.
Who this is for
- MLOps engineers, platform engineers, and ML engineers operating models in production.
- Data engineers adding observability to ML pipelines.
Prerequisites
- Comfort with a service framework (e.g., FastAPI/Flask) and Python.
- Basic understanding of containers or Linux services.
- Familiarity with Prometheus-style metrics and OpenTelemetry basics helps.
Learning path
- Start: structured logging and consistent fields.
- Add: Prometheus/OpenMetrics with counters, gauges, histograms.
- Introduce: tracing with OpenTelemetry and context propagation.
- Define: SLOs and alerts (burn-rate, latency quantiles).
- Harden: dashboards, sampling strategy, cost control.
Mini challenge
Ship a new model version and prove with metrics and traces that it improves P95 latency by 10% and does not increase error rate. Provide one screenshot-equivalent description: metric query used, quantile chosen, and example trace showing the change in model.infer span duration.
Next steps
- Automate: add instrumentation to your service template so every new model starts observable by default.
- Governance: document required fields for logs/metrics/traces across all ML services.
- Reliability: add synthetic checks that call your API and validate end-to-end spans.
Quick Test
This quick test is available to everyone. Log in to save your progress.