Why this matters
As a Machine Learning Engineer, your models run in long training jobs, pipelines, and production services. When things go wrong or performance drifts, you must answer: What happened? With which parameters? On which data? Logging gives you an auditable trail; configuration management ensures runs are reproducible and controllable.
- Training: capture hyperparameters, seeds, dataset versions, metrics, and time-to-epoch.
- Batch inference: trace which model and config processed which batch, with correlation IDs.
- Online inference: structured logs for each request, latency breakdown, and error details without leaking sensitive data.
- On-call: quickly filter logs by level, component, or request ID to locate issues.
Concept explained simply
Logging is your systems diary. Config management is how you decide what your system should do before it runs.
- Logging: record significant events with levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) and context (model version, run_id).
- Config management: load parameters from defaults, files, environment variables, and CLI flags, in a clear precedence order.
Mental model: Flight recorder + switchboard
Think of logging as a flight recorder: it continuously writes structured facts you can replay later. Config is the switchboard where you set dials before takeoffdataset path, learning rate, log levelso runs are deliberate and repeatable.
Essential pieces you will use
- Logger hierarchy: loggers (getLogger(__name__)), handlers (Console, File), formatters (text/JSON), filters, levels.
- Configuration: logging.config.dictConfig for centralized setup; avoid ad hoc basicConfig in production.
- Structured logs: key=value or JSON fields for easy parsing (e.g., run_id, request_id, model_version).
- Config sources: defaults in code, file (JSON/YAML), environment variables, CLI flags. Prefer precedence: CLI > ENV > file > defaults.
- Secrets: never log API keys, tokens, passwords. Scrub before logging.
- Reproducibility: always log seeds, data snapshot IDs, code version (git SHA), and config used for a run.
Worked examples
Example 1: Minimal robust logger (console + file, level from ENV)
Goal: human-readable console logs and a rotating file for deeper debugging.
import logging, os
from logging.config import dictConfig
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")
LOG_FILE = os.getenv("LOG_FILE", "logs/train.log")
LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"console_fmt": {"format": "%(levelname)s %(name)s: %(message)s"},
"file_fmt": {"format": "%(asctime)s %(levelname)s %(name)s run=%(run_id)s: %(message)s"}
},
"filters": {
"run_ctx": {
"()": lambda: type("F", (), {"filter": lambda s, r: (r.__dict__.setdefault("run_id", os.getenv("RUN_ID", "local")) or True)})()
}
},
"handlers": {
"console": {
"class": "logging.StreamHandler",
"level": LOG_LEVEL,
"formatter": "console_fmt"
},
"file": {
"class": "logging.handlers.TimedRotatingFileHandler",
"level": "DEBUG",
"formatter": "file_fmt",
"filename": LOG_FILE,
"when": "midnight",
"backupCount": 7
}
},
"root": {
"level": "DEBUG",
"handlers": ["console", "file"]
}
}
dictConfig(LOGGING)
logger = logging.getLogger(__name__)
logger.info("Starting training")
logger.debug("Batch loaded: size=128")
logger.warning("Validation accuracy plateaued")
What you get
Console (filtered by LOG_LEVEL):
INFO __main__: Starting training
WARNING __main__: Validation accuracy plateaued
File (DEBUG+ with run_id and timestamp):
2026-01-01 00:00:00,000 INFO __main__ run=local: Starting training
2026-01-01 00:00:00,001 DEBUG __main__ run=local: Batch loaded: size=128
2026-01-01 00:00:00,002 WARNING __main__ run=local: Validation accuracy plateaued
Example 2: Structured JSON logs for services
Goal: machine-parseable logs. Implement a tiny JSON formatter.
import logging, json, time
class JsonFormatter(logging.Formatter):
def format(self, record):
payload = {
"ts": time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime(record.created)),
"level": record.levelname,
"logger": record.name,
"msg": record.getMessage(),
}
# Add custom fields if present
for key in ("run_id", "request_id", "model_version"):
if key in record.__dict__:
payload[key] = record.__dict__[key]
return json.dumps(payload)
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger = logging.getLogger("inference")
logger.setLevel(logging.INFO)
logger.handlers = [handler]
logger.info("predict start", extra={"request_id": "abc123", "model_version": "v5"})
What you get
{"ts": "2026-01-01T00:00:00", "level": "INFO", "logger": "inference", "msg": "predict start", "request_id": "abc123", "model_version": "v5"}
Example 3: Simple config system (defaults + file + ENV + CLI)
Goal: predictable overrides without external dependencies.
import os, json, argparse
from dataclasses import dataclass
@dataclass
class TrainingConfig:
data_path: str = "./data/train.csv"
lr: float = 1e-3
epochs: int = 10
seed: int = 42
log_level: str = "INFO"
@staticmethod
def from_sources(file_path: str | None, args: list[str] | None = None) -> "TrainingConfig":
cfg = TrainingConfig()
# 1) File (JSON) if provided
if file_path and os.path.exists(file_path):
with open(file_path) as f:
d = json.load(f)
for k, v in d.items():
if hasattr(cfg, k):
setattr(cfg, k, v)
# 2) ENV (prefix: APP_)
env_map = {
"data_path": os.getenv("APP_DATA_PATH"),
"lr": os.getenv("APP_LR"),
"epochs": os.getenv("APP_EPOCHS"),
"seed": os.getenv("APP_SEED"),
"log_level": os.getenv("APP_LOG_LEVEL"),
}
for k, v in env_map.items():
if v is not None:
cast = type(getattr(cfg, k))
setattr(cfg, k, cast(v))
# 3) CLI overrides
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument("--data_path")
parser.add_argument("--lr", type=float)
parser.add_argument("--epochs", type=int)
parser.add_argument("--seed", type=int)
parser.add_argument("--log_level")
ns, _ = parser.parse_known_args(args)
for k, v in vars(ns).items():
if v is not None:
setattr(cfg, k, v)
return cfg
# Example usage:
# cfg = TrainingConfig.from_sources("config.json", ["--lr", "0.0005", "--epochs", "20"])
Behavior
- Defaults fill missing values.
- config.json overrides defaults.
- Environment (APP_*) overrides file.
- CLI overrides everything else.
Example 4: Correlation ID across a service
Goal: attach request_id to every log line without repeating extra={...} each time.
import logging
class RequestAdapter(logging.LoggerAdapter):
def process(self, msg, kwargs):
kwargs.setdefault("extra", {})
kwargs["extra"]["request_id"] = self.extra.get("request_id", "-")
return msg, kwargs
base = logging.getLogger("svc")
base.setLevel(logging.INFO)
base.addHandler(logging.StreamHandler())
logger = RequestAdapter(base, {"request_id": "abc123"})
logger.info("decode input")
logger.warning("timeout talking to feature store")
What you get
INFO svc: decode input
WARNING svc: timeout talking to feature store
Internally, each record carries request_id, which formatters can print or serialize.
Exercises
Do these in order. The Quick Test at the end checks your understanding.
- Exercise 1 (ex1): Configure logging with dictConfig to log INFO to console and DEBUG+ to a file logs/train.log. Add run_id to file logs. Prove that DEBUG appears only in file.
- Exercise 2 (ex2): Build a TrainingConfig (dataclass) that loads from config.json, then APP_* env vars, then CLI flags, in that precedence. Print the final config and log it at startup.
- Exercise 3 (ex3): Use LoggerAdapter (or a Filter) to add request_id to all logs in a function handling inference. Show two different request IDs across two calls.
Checklist before taking the test
- I can explain the difference between logger, handler, and formatter.
- I can switch log levels via an environment variable.
- I can persist DEBUG logs to a file while keeping console cleaner.
- I can load config from file and override it with ENV and CLI.
- I can attach a run_id or request_id to every log line.
Common mistakes and self-checks
- Using print instead of logging: prints lack levels, handlers, and context. Replace prints with logger calls.
- No single place to configure logging: spread basicConfig calls lead to duplicates. Centralize with dictConfig.
- Logging secrets: API keys or tokens must be redacted. Add scrubbing or avoid logging raw payloads.
- Unstructured logs: hard to filter or analyze. Prefer key=value or JSON fields for identifiers and metrics.
- Wrong precedence of config sources: document and enforce CLI > ENV > file > defaults.
- Too verbose in production: set appropriate level (INFO/WARNING) and route DEBUG to files only.
- Handler duplication: adding handlers each import/run. Clear or guard against duplicates before adding.
Self-check prompts
- If latency spikes, can you filter logs by request_id and see where time was spent?
- Given a run from last week, can you reconstruct lr, data snapshot, and seed from logs/config alone?
- Can you turn on DEBUG without code changes (via env/CLI)?
Practical projects
- Training pipeline logger: add JSON logs with run_id, epoch, loss, lr, and time per step; write to rotating files.
- Inference microservice: implement LoggerAdapter for request_id and model_version; console JSON logs only.
- Config pack: a small library that loads defaults + file + ENV + CLI with type casting and prints a redacted config summary at startup.
Who this is for
- Machine Learning Engineers deploying training and inference systems.
- Data Scientists transitioning to production-grade ML workflows.
- Backend engineers integrating ML services.
Prerequisites
- Comfortable with Python basics (functions, modules, imports).
- Familiar with the standard library (os, argparse, dataclasses).
- Basic command-line usage and environment variables.
Learning path
Next steps
- Add metrics (timers/counters) alongside logs to observe performance.
- Integrate model/version info into every log line for auditability.
- Automate log rotation and retention policies in your deployment environment.
Mini challenge
Create a small CLI training script that:
- Loads config with the described precedence.
- Logs INFO to console, DEBUG to a rotating file.
- Includes run_id and seed in every line.
- Can toggle JSON logs on/off via a CLI flag.
Quick Test
The test is available to everyone. Log in to save your progress so you can resume later.