Why this matters
As a Data Platform Engineer, you run hundreds or thousands of scheduled workflows (DAGs, jobs, tasks). Without solid observability, failures go unnoticed, SLAs are missed, and data consumers lose trust. Strong observability helps you:
- Detect failures and slowdowns before stakeholders do.
- Prove data freshness and SLA compliance with measurable SLOs.
- Pinpoint the root cause faster (is it code, data, infra, or a dependency?).
- Safely backfill and rerun with confidence.
Real tasks you will handle
- Set alerts for DAG failures, schedule delays, and data quality breaches.
- Build dashboards: run success rate, runtime distributions, backlog, freshness.
- Correlate logs, metrics, and traces across orchestrator, compute, and storage.
- Define SLOs (e.g., p95 runtime, success rate) and report on them monthly.
Concept explained simply
Observability for workflows is knowing what ran, how it ran, why it failed or slowed, and how it affects downstream data. You achieve this with three signals:
- Logs: Detailed text records of what tasks did and why they failed.
- Metrics: Numbers you can aggregate over time (success rate, runtimes, queue delays).
- Traces: Request flow across components with timing to see where time is spent.
Mental model
Think of each workflow run as a flight. Metrics tell you fleet health (on-time rate, average delay). Logs are the cockpit voice recorder (what happened). Traces are the black box timeline (where time was spent across systems). Add metadata (run_id, dag_id, task_id, correlation_id) to connect them.
Core components to implement
- Run and task metadata: dag_id, task_id, run_id, attempt, owner/team, tags (env, dataset).
- Key metrics (emit per task and per DAG):
- Success rate, retry count, failure count.
- Runtime (mean, p95), queue wait time, schedule delay.
- Backlog size for scheduled/backfill runs.
- Data freshness and data quality pass rate.
- Structured logs with context: Always include dag_id, task_id, run_id, correlation_id; log start/finish with status and duration.
- Traces spanning orchestrator → compute → storage → external APIs, with span attributes for workflow context.
- SLOs and alerts:
- SLO examples: monthly success rate ≥ 99.5%, p95 runtime ≤ target, freshness ≤ target, schedule delay p95 ≤ threshold.
- Alert tiers: P1 (broad impact), P2 (SLO risk), P3 (single run failure with auto-retry).
- Dashboards: Daily/weekly trend panels, p95 runtime, failures by task, top flakiest tasks, freshness heatmap.
Worked examples
Example 1: Nightly Sales ETL runs slow
- Symptom: Runtime increased from 20 min to 55 min (p95).
- Check metrics: p95 runtime spike and increased queue wait time.
- Open trace: See 25 min waiting on warehouse cluster and 10 min on API calls.
- Logs show: Warehouse concurrency reduced due to maintenance window.
- Action: Temporarily increase priority for this DAG and adjust schedule to avoid maintenance window. Add alert on queue wait p95 > 10 min.
Example 2: Data quality failure on daily Customers DAG
- Symptom: DQ pass rate dips to 92% (threshold 98%).
- Structured log: task_id=validate_customers; shows 8% null emails.
- Trace: Upstream ingest task shows schema drift (new optional column).
- Action: Add null-handling transform, update expectation suite, backfill last 2 days, re-run DQ. Add alert on null_rate > 2%.
Example 3: Missed SLA due to schedule delays
- Symptom: p95 schedule delay increases from 2 min to 20 min.
- Metrics: Backlog grows and executor saturation rises.
- Trace: Orchestrator scheduler loop takes longer after adding 300 new tasks.
- Action: Scale scheduler/executor, shard large DAG, add alert on backlog size and schedule delay. Track success rate post-change.
How to set it up (vendor-neutral steps)
- Define context fields: dag_id, task_id, run_id, attempt, correlation_id, env, owner.
- Emit metrics at task start/finish and DAG finish: status, duration_ms, retries, queue_wait_ms, schedule_delay_ms, data_freshness_min.
- Log in structured JSON lines with the same context fields; include error types and key parameters.
- Instrument traces so each task is a span; propagate correlation_id to downstream services.
- Create SLOs and alert rules: success rate, p95 runtime, p95 schedule delay, freshness, DQ pass rate.
- Build dashboards: overview (SLOs), reliability (failures, retries), performance (runtimes, queue), quality (DQ), freshness.
- Run a game day: intentionally fail a non-critical task and verify detection, alert, and recovery steps.
Checklist: Good observability for a DAG
- All logs include dag_id, task_id, run_id, correlation_id.
- Task start/finish metrics emitted with status and duration.
- Trace spans cover orchestrator → compute → storage.
- Alerts exist for failures, schedule delay, freshness, and DQ.
- Dashboard shows p95 runtime and success rate trends.
- Runbook linked in alert text (short, actionable steps).
Exercises — practice
Note: Everyone can do the exercises and take the quick test. Only logged-in users get saved progress.
Exercise 1: Define SLOs and alerts for a nightly marketing DAG
See the Exercises section below for full instructions and a solution. ID: ex1.
Exercise 2: Troubleshoot a slow task with logs, metrics, and traces
See the Exercises section below for full instructions and a solution. ID: ex2.
Common mistakes and self-check
- Only alerting on failures, not on delays or freshness. Self-check: Do you alert on p95 schedule delay and data freshness?
- Logs without context. Self-check: Do logs include dag_id, task_id, run_id, correlation_id?
- Dashboards without distributions. Self-check: Do you track p95/p99 runtimes, not just averages?
- No ownership in alerts. Self-check: Does each alert include owner/team and a runbook hint?
- Over-alerting. Self-check: Are low-priority issues routed as P3 or muted during maintenance?
Practical projects
- Instrument one production DAG with structured logs, task metrics, and a trace span per task. Add two alerts (failure and schedule delay).
- Create a reliability dashboard: run success rate, failures by task, p95 runtime, backlog size, and freshness.
- Backfill game day: simulate a failure in a non-critical task, validate that alerts fire, then fix and measure time-to-recovery.
Mini challenge
Your nightly Customer 360 DAG has p95 runtime creeping from 30 to 50 minutes over two weeks, with no increase in data volume. In one paragraph, describe the top three metrics and trace spans you would inspect first, the likely culprits, and the smallest change you would test to confirm.
Who this is for
- Data Platform Engineers who own orchestration and scheduling.
- SREs supporting data infrastructure.
Prerequisites
- Basic DAG/workflow concepts (tasks, retries, schedules).
- Familiarity with logging and metrics at a high level.
- Ability to read dashboards and interpret percentiles.
Learning path
- Instrument logs with context fields.
- Add task and DAG-level metrics (status, duration, delay).
- Introduce traces across critical tasks.
- Define SLOs and create alert policies.
- Build a dashboard and run a game day.
Next steps
- Apply these patterns to your top 3 business-critical DAGs.
- Standardize a logging/metrics/tracing library for all workflows.
- Review alert noise monthly and tune thresholds.