luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Centralized Logging Metrics Tracing

Learn Centralized Logging Metrics Tracing for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

Data Platform Engineers own reliability. Centralized logging, metrics, and tracing let you see what each pipeline did, how long it took, and where it failed—without clicking through scattered UIs. You will use this to:

  • Triaging incidents fast (e.g., a late daily table, a stuck consumer, a failing job).
  • Proving data quality with SLIs/SLOs (freshness, volume, errors, schema changes).
  • Finding performance bottlenecks (e.g., skewed joins, slow sinks, external API slowness).
  • Coordinating changes safely (observe impact of new partitions, schema evolution).
  • Reducing toil via searchable, structured, centralized telemetry.

Concept explained simply

Think of your platform as a city. Logs are the city diaries (what happened). Metrics are the dashboard gauges (how much, how often, how fast). Traces are the route maps (how a request/job moved through components).

Mental model: The golden thread

Every pipeline run gets a correlation_id (or trace_id). It flows through Airflow/Orchestrator, ingestion (e.g., CDC/stream), processing (e.g., Spark/DBT), and serving (warehouse, lake, APIs). You stitch telemetry with that ID to follow the golden thread end-to-end.

Core building blocks

Centralized, structured logging

  • Format: JSON per line. Avoid free-text only.
  • Minimum fields: timestamp (UTC ISO-8601), level, service, component, env, pipeline, job_name, run_id, correlation_id, span_id (optional), dataset, event, message, error_type, duration_ms, row_count.
  • Practices: consistent keys, redact sensitive data, avoid excessive log volume, set retention tiers (hot vs. cold), index wisely for search.
Sample log events (start/success/error)
{"timestamp":"2026-01-11T01:23:45Z","level":"INFO","service":"orchestrator","component":"scheduler","env":"prod","pipeline":"orders_daily","job_name":"extract_orders","run_id":"2026-01-11","correlation_id":"corr-2026-01-11-orders","event":"job_start","message":"starting extract"}
{"timestamp":"2026-01-11T01:30:12Z","level":"INFO","service":"spark","component":"transform","env":"prod","pipeline":"orders_daily","job_name":"clean_orders","run_id":"2026-01-11","correlation_id":"corr-2026-01-11-orders","event":"job_success","row_count":152342,"duration_ms":357000}
{"timestamp":"2026-01-11T01:31:00Z","level":"ERROR","service":"warehouse_loader","component":"load","env":"prod","pipeline":"orders_daily","job_name":"load_orders","run_id":"2026-01-11","correlation_id":"corr-2026-01-11-orders","event":"job_error","error_type":"duplicate_key","message":"constraint violation on orders"}

Metrics you can alert on

  • Types: counters (ever-increasing), gauges (current value), histograms (distribution/percentiles).
  • Naming: metric_namespace.metric_name (e.g., pipeline.freshness_seconds, job.duration_ms, kafka.consumer_lag).
  • Labels (tags): prefer low-cardinality: env, pipeline, job, dataset, status. Avoid user_id, order_id, file_name patterns.
  • Key SLIs: freshness_seconds, completeness_ratio (or volume_delta), error_rate, processing_latency_ms (p50/p95/p99), schema_change_events.
  • Example SLOs: freshness_seconds <= 3600 for prod daily tables 99% of days; error_rate < 0.1% weekly.

Tracing spans for data flows

  • Spans: parent/child hierarchy: orchestrator run → tasks → ingestion → processing steps → load.
  • Propagation: pass trace_id (or correlation_id) via headers, job params, message metadata, or temp tables.
  • Sampling: keep all error traces; sample successes (e.g., 1–10%) to control cost.
Minimal propagation plan
  • Orchestrator generates correlation_id per run_id.
  • Attach to task params, log MDC/context, and message headers.
  • Processors read and forward the same ID; emit spans with parent_span_id.
  • Loader writes the ID into load audit tables for joinable lineage.

Worked examples

1) Late-arriving daily table

  • Symptom: dashboard shows freshness_seconds spike to 12,000.
  • Metrics: consumer_lag gauge high on ingestion; processing_duration normal.
  • Logs: errors show API rate limiting upstream.
  • Trace: long span on ingestion step only.
  • Fix: increase backoff, add retry jitter, raise concurrency off-peak; add SLO burn-rate alert for freshness.

2) Streaming deserialization errors

  • Symptom: error_rate jumps to 2% with error_type=serde_error.
  • Logs: schema_change event preceded the spike.
  • Trace: short spans failing fast at consumer; no downstream load spans.
  • Fix: add schema evolution step, backward-compatible defaults, canary consumer; alert on schema_change_events with error_rate correlation.

3) Slow Spark transform

  • Symptom: p99 processing_latency_ms doubled; p50 unchanged.
  • Logs: stage with skewed join; row_count ok.
  • Trace: child span for join stage dominates duration; others normal.
  • Fix: add salting, broadcast small dimension, verify partitioning; monitor p95/p99 specifically.

Implementation checklist

  • [ ] Use JSON structured logs with UTC timestamps.
  • [ ] Include correlation_id and run_id in every log line, metric, and trace span.
  • [ ] Define metric names, units, and label conventions.
  • [ ] Emit histograms for durations; counters for processed_rows; gauges for lags.
  • [ ] Capture error_type and message safely (no sensitive data).
  • [ ] Control cardinality: review labels; cap unique values.
  • [ ] Set sampling rules for traces (all errors, sample successes).
  • [ ] Define SLIs and SLOs; create burn-rate alerts.
  • [ ] Store correlation_id in load/audit tables for joins with business data.
  • [ ] Document queries and runbooks for common incidents.

Exercises

Complete these mini tasks. The test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Log schema and samples

Design a structured log schema for a daily pipeline and write three sample lines: job_start, job_success, job_error. Mirror the fields across stages.

Hints
  • Keep keys consistent: pipeline, job_name, run_id, correlation_id.
  • Include duration_ms where it makes sense.
  • Use UTC ISO timestamps.

Exercise 2 — SLIs/SLOs and alerts

For a daily dataset, define SLIs, targets, and alert rules that avoid noise.

Hints
  • Pick freshness and error_rate at minimum.
  • Use two-window burn-rate alerts (e.g., 5m and 1h).
  • Add ownership label for routing.

Exercise 3 — Trace propagation map

Map how correlation_id flows across orchestrator → ingestion → processing → load. Specify where it is stored and how it is forwarded.

Hints
  • Orchestrator should generate and inject the ID.
  • Streaming uses message headers/metadata; batch uses params or temp tables.
  • Write the ID to audit tables.

Common mistakes and self-check

  • Unstructured logs: Hard to search. Fix: JSON with stable keys.
  • No correlation_id: You cannot stitch stages. Fix: generate per run and propagate.
  • High-cardinality labels: Explodes storage/cost. Fix: keep labels coarse (env, pipeline, job, status).
  • No histograms: Averages hide tail issues. Fix: record p50/p95/p99.
  • Alert fatigue: One alert per symptom per owner with clear actions. Use burn-rate.
  • Mixed timezones: Use UTC everywhere.
  • No retention tiers: Costs spike. Fix: hot short, cold longer; sample traces.
  • Secrets in logs: Redact tokens/PII early.
Self-check mini audit
  • Pick any job run. Can you query logs by correlation_id and see all stages?
  • Can you produce p95 duration for last 7 days by job?
  • Can you list top 3 error_types this week?
  • Do you have an SLO for freshness with a working alert?

Practical projects

  • Project A: Simulate a pipeline with three steps (extract, transform, load). Write JSON logs to files, include correlation_id. Aggregate and query them to answer: total rows processed, p95 step time, top error_type.
  • Project B: Create a simple metrics emitter that writes counters/gauges/histograms to a local file in line protocol-style. Compute freshness_seconds and processing_latency_ms percentiles daily.
  • Project C: Build a trace file per run with parent/child spans. Visualize as an indented tree; verify that the critical path matches observed latency.

Learning path

  1. Start with structured JSON logging and a correlation_id.
  2. Add core metrics: processed_rows, durations (histogram), error_rate, freshness_seconds.
  3. Introduce tracing spans for critical paths; enable error sampling 100%.
  4. Define SLIs/SLOs; implement burn-rate alerts.
  5. Build dashboards and runbooks tied to correlation_id.
  6. Harden: cardinality guards, retention tiers, PII redaction, load audits.

Who this is for

  • Data Platform Engineers who own pipeline reliability.
  • Data Engineers adding observability to batch/streaming jobs.
  • Analytics Engineers responsible for trustworthy tables.

Prerequisites

  • Comfort with JSON and basic logging concepts.
  • Understanding of batch and/or streaming pipelines.
  • Basic knowledge of metrics (counters, gauges, histograms).

Next steps

  • Do the exercises above.
  • Take the quick test below to check your understanding. Test is available to everyone; only logged-in users get saved progress.
  • Apply the checklist to one real pipeline this week.

Mini challenge

Write a 5-line runbook snippet: a one-liner log query to fetch all events for the latest failed run by pipeline name, what metric to check next, and one action you would take if p99 latency is high but p50 is normal.

Practice Exercises

3 exercises to complete

Instructions

Define a minimal JSON log schema for a daily pipeline with steps: extract, transform, load. Include fields to support search, correlation, and performance analysis. Then write three sample log lines: job_start, job_success, job_error.

  • Keep timestamps in UTC ISO-8601.
  • Include correlation_id and run_id consistently.
  • Capture duration_ms where applicable.
Expected Output
Three JSON lines that share the same correlation_id and run_id, containing at least: timestamp, level, service, component, env, pipeline, job_name, run_id, correlation_id, event, and optionally row_count, duration_ms, error_type, message.

Centralized Logging Metrics Tracing — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Centralized Logging Metrics Tracing?

AI Assistant

Ask questions about this tool