Why this matters
As a Data Engineer, you run data pipelines that must be fresh, accurate, and fast. Observability gives you the ability to detect and explain problems quickly. When a table is late, a streaming consumer lags, or a schema changes unexpectedly, metrics, logs, and traces help you pinpoint cause and impact—before stakeholders notice.
- Real tasks: ensure daily tables land by 02:00 UTC; keep Kafka consumer lag low; detect schema drift; prove SLAs/SLOs; debug slow transformations.
- Outcomes: fewer incidents, faster MTTR (mean time to recovery), predictable delivery.
Concept explained simply
Observability is the ability to answer new questions about system behavior using outputs like metrics, logs, and traces—without changing code each time a question appears.
- Metrics: numeric time series you graph and alert on (e.g., freshness minutes, row counts, error rates, consumer lag).
- Logs: structured events with context (who/what/when), great for root cause analysis.
- Traces: end-to-end timelines across components; each step is a span with duration and status.
Mental model
Think of an airport:
- Metrics: flights per hour, average delay, baggage throughput.
- Logs: incident reports when a conveyor stops, gate changes, or a scanner fails.
- Traces: a passenger’s journey from check-in to gate to baggage claim, showing where time was spent.
Key signals for data pipelines
- Freshness: now minus last successful load time.
- Schedule delay: actual start/finish versus expected window.
- Completeness: input vs output row counts; dropped/duplicate rows.
- Quality check rates: null ratio, constraint violations, invalid values.
- Schema drift: schema fingerprint/version changes.
- Latency: batch end-to-end duration; streaming event-time to sink-write-time.
- Throughput: rows processed per minute/second.
- Errors: failed tasks, DLQ (dead-letter queue) count.
- Resource saturation: CPU, memory, I/O, shuffle spill (for distributed jobs).
- Streaming lag: queue/consumer lag, watermark delay.
Instrumentation basics
- Define SLIs/SLOs: e.g., “Daily sales table freshness ≤ 15 minutes 95% of days.”
- Emit metrics:
- Counters: increasing totals (processed_rows_total, errors_total, schema_changes_total).
- Gauges: up/down values (freshness_minutes, consumer_lag_messages).
- Histograms: distributions (batch_duration_seconds, row_size_bytes).
- Write structured logs: JSON-like fields such as timestamp, pipeline, job_id, run_id, dataset, partition, severity, message, error_type, correlation_id.
- Propagate context: a correlation_id or trace_id across orchestrator → extractor → transformer → loader, plus dataset and partition tags.
- Add traces: spans for extract, transform, and load with attributes (dataset, partition, input_rows, output_rows, duration, status).
- Build dashboards and alerts: visualize SLIs; alert on SLO violations and hard failures.
Minimal field set to include everywhere
- correlation_id (or trace_id)
- pipeline, task_name, run_id
- dataset, partition (date/hour), environment
- status (ok/fail), error_type (if any)
Worked examples
Example 1: Daily batch freshness SLO
Goal: “Table t_orders updated by 02:00 UTC with freshness ≤ 15 minutes 95% of days.”
- Emit gauge freshness_minutes = now - last_successful_load_time.
- Emit histogram batch_duration_seconds for each run.
- Alert if freshness_minutes > 30 for 2 consecutive checks or if run fails.
What you should see
- Normal: freshness 3–10 minutes; duration 6–9 minutes; errors_total = 0.
- Incident: freshness jumps to 45 minutes; logs show upstream API timeout; trace shows extract span long.
Example 2: Streaming consumer lag and end-to-end latency
Pipeline: Kafka → Stream Processor → Warehouse upserts.
- Gauge consumer_lag_messages; histogram end_to_end_latency_seconds = sink_write_time - event_time.
- Counter dlq_messages_total for records routed to DLQ.
- Trace spans: poll → parse → enrich → upsert with correlation_id.
What you should see
- Normal: lag < 2,000; p95 latency < 10s.
- Incident: lag climbs to 50,000; traces show upsert span slow; logs show warehouse throttling.
Example 3: Schema drift detection
Source adds a new nullable column unexpectedly.
- Counter schema_changes_total increments.
- Log event with schema_fingerprint_old/new and changed_columns.
- Trace transform span includes attribute schema_change=true.
What you should see
- Alert on schema change in production.
- Correlate the change with downstream model failures via the same correlation_id.
Dashboards and alerting
- Golden signals: latency, traffic/throughput, errors, saturation.
- Data-specific signals: freshness, completeness, schema drift, DLQ counts, consumer lag, watermark delay.
Sample SLOs and alerts
- SLO: “95% of hourly ingestion jobs finish within 10 minutes.” Alert: if batch_duration_seconds p95 > 900 for last 3 hours.
- SLO: “Streaming p95 end-to-end latency < 15s.” Alert: if > 15s for 10 minutes.
- SLO: “Freshness <= 15 min for daily tables.” Alert: freshness_minutes > 30 for 2 checks.
Exercises
Complete the exercises below. The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1: Instrument a nightly batch
Design metrics, logs, and traces for a nightly ETL from object storage to a warehouse. See the Exercises section below for full instructions.
Exercise 2: Tame streaming lag
Define SLOs, alerts, and a troubleshooting playbook for a streaming pipeline with rising lag. See the Exercises section below for full instructions.
Self-check checklist
- Included at least one freshness and one completeness metric.
- Logs include correlation_id, dataset, partition, and error_type.
- Traces cover extract, transform, and load spans with durations.
- SLOs have clear thresholds and time windows.
- Alert rules avoid flapping by using short stability windows.
Common mistakes and how to self-check
- Unstructured logs: hard to query. Fix: enforce key=value or JSON-like fields.
- No correlation_id: can’t join metrics, logs, and traces. Fix: generate once, pass through all stages.
- Alerting on single data points: leads to noise. Fix: use rolling windows and percentiles.
- Too many high-cardinality labels (e.g., user_id): storage and cost blowups. Fix: limit labels to stable dimensions (dataset, partition, env).
- Ignoring event-time vs processing-time: hides true latency. Fix: track both; compute end-to-end latency from event_time.
- No separation by environment: DEV data triggers PROD alerts. Fix: include environment dimension everywhere.
- No retention/aggregation: metrics store grows unbounded. Fix: downsample old data and set retention per signal.
Self-audit in 5 minutes
- Open any failed run: can you find all steps in one trace?
- Filter logs by correlation_id: do you see extractor, transformer, and loader entries?
- Check freshness dashboard: are SLOs visible and explainable to non-engineers?
Practical projects
- Build a freshness dashboard: for 3 key tables, plot freshness_minutes, batch_duration_seconds, and errors_total; add alerts.
- Streaming observability MVP: emit consumer_lag_messages, end_to_end_latency_seconds, and dlq_messages_total; create a p95 latency alert.
- Schema drift watchdog: compute schema fingerprints for two sources; increment schema_changes_total and log changed_columns; alert on PROD.
Learning path
- Understand SLIs/SLOs and the three pillars (metrics, logs, traces).
- Instrument one batch job end-to-end (emit metrics, structured logs, simple traces).
- Add a streaming job with consumer lag and event-time latency.
- Introduce schema drift detection and DLQ metrics.
- Harden alert rules; add runbooks for common incidents.
Who this is for
- Data Engineers responsible for batch or streaming pipelines.
- Analytics Engineers and Platform Engineers who need reliable data delivery.
- New hires joining data teams that own SLAs/SLOs.
Prerequisites
- Basic understanding of ETL/ELT and orchestration (e.g., task scheduling, retries).
- Comfort with logs and time-series thinking.
- Familiarity with batch and/or streaming concepts (event time vs processing time).
Next steps
- Finish the exercises and compare with the provided solutions.
- Take the quick test below to check your readiness. Note: Everyone can take it; only logged-in users get saved progress.
- Apply instrumentation to one real pipeline this week and review the results with your team.
Mini challenge
Your team is launching a new hourly enrichment job that joins clickstream with product data.
- Define 3 SLIs and 1 SLO related to latency and freshness.
- List the minimal fields you will include in logs.
- Sketch the trace spans you want to see during a run.