Debugging Model Issues With Logs

Learn Debugging Model Issues With Logs for free with explanations, exercises, and a quick test (for Applied Scientist).

Published: January 7, 2026 | Updated: January 7, 2026

Why this matters

Prevention: contract tests on schemas, canary validation, and alerts on null-rate spikes.

Exercise 2 solution

Bottleneck: feature_ms (180–205 ms vs others < 12 ms). Short-term: add read-through cache, increase timeouts/connection pool, batch/parallel fetch. Long-term: optimize feature store queries and index hot keys. Alert: P95 feature_ms >= 120 ms for 5 consecutive minutes or feature_ms/total_ms ratio > 0.6.

Common mistakes and self-check

Logging raw inputs (PII risk). Self-check: scan logs for emails/phones—should be redacted or hashed.
Only logging errors. Self-check: confirm INFO logs include timings and versions for healthy requests.
No correlation_id. Self-check: every line for a request shares the same id.
Human-readable-only logs. Self-check: ensure structured JSON with stable keys.
Ignoring post-processing. Self-check: decision logic, thresholds, and types are logged.
Over-logging payloads. Self-check: use summaries (null counts, binned stats), not full dumps.

Practical projects

Add structured JSON logging to an inference service: include request_id, timings, model_version, prediction, decision, feature_nulls.
Create a log-based canary check: on deploy, assert 0 model_version mismatches and feature_nulls <= baseline.
Build a small anomaly detector that watches P95 feature_ms and feature_null rates, emitting WARN/ERROR events.

Mini challenge

Find the hidden bug

{"request_id":"r-909","prediction":0.43,"threshold":"0.4","decision":"reject","types":{"prediction":"float","threshold":"string"}}

What is wrong and how do you prove the fix via logs?

Expected: decision should be approve when threshold is correctly typed. Add a startup log asserting numeric types, and log the comparator used.

Learning path

Before: Service basics (APIs, request lifecycle), monitoring vs logging.
Now: Debugging with logs (this lesson).
Next: Alerting and on-call runbooks, tracing, and automated canary checks.

Next steps

Instrument missing fields and per-stage timings if absent.
Add version assertions and decision-logic logs.
Set alerts for null-rate spikes and feature_ms P95.

Progress & saving note

The quick test below is available to everyone. If you are logged in, your progress will be saved automatically.

Practice Exercises

2 exercises to complete

Instructions

Use the log below to diagnose the issue and propose an immediate mitigation and a preventive measure.

{"ts":"2026-01-07T11:03:10Z","request_id":"r-3009","route":"blue","model_version":"3.4.2","schema_version":"v21","pre_ms":9,"feature_ms":24,"inference_ms":5,"post_ms":4,"total_ms":44,"feature_nulls":3,"feature_stats":{"price_mean":null,"inventory_null_rate":0.62},"prediction":0.12,"threshold":0.4,"decision":"reject","errors":[]}

What likely changed?
What is your immediate mitigation?
What prevention would you implement?

Expected Output

Upstream/schema change caused key features to be null; rollback or impute now; add schema contract tests and alerts on null-rate spikes.