Topic Not Found

Why this matters

As a Data Platform Engineer, you run hundreds or thousands of scheduled workflows (DAGs, jobs, tasks). Without solid observability, failures go unnoticed, SLAs are missed, and data consumers lose trust. Strong observability helps you:

Detect failures and slowdowns before stakeholders do.
Prove data freshness and SLA compliance with measurable SLOs.
Pinpoint the root cause faster (is it code, data, infra, or a dependency?).
Safely backfill and rerun with confidence.

Real tasks you will handle

Set alerts for DAG failures, schedule delays, and data quality breaches.
Build dashboards: run success rate, runtime distributions, backlog, freshness.
Correlate logs, metrics, and traces across orchestrator, compute, and storage.
Define SLOs (e.g., p95 runtime, success rate) and report on them monthly.

Concept explained simply

Observability for workflows is knowing what ran, how it ran, why it failed or slowed, and how it affects downstream data. You achieve this with three signals:

Logs: Detailed text records of what tasks did and why they failed.
Metrics: Numbers you can aggregate over time (success rate, runtimes, queue delays).
Traces: Request flow across components with timing to see where time is spent.

Mental model

Think of each workflow run as a flight. Metrics tell you fleet health (on-time rate, average delay). Logs are the cockpit voice recorder (what happened). Traces are the black box timeline (where time was spent across systems). Add metadata (run_id, dag_id, task_id, correlation_id) to connect them.

Core components to implement

Run and task metadata: dag_id, task_id, run_id, attempt, owner/team, tags (env, dataset).
Key metrics (emit per task and per DAG):
- Success rate, retry count, failure count.
- Runtime (mean, p95), queue wait time, schedule delay.
- Backlog size for scheduled/backfill runs.
- Data freshness and data quality pass rate.
Structured logs with context: Always include dag_id, task_id, run_id, correlation_id; log start/finish with status and duration.
Traces spanning orchestrator → compute → storage → external APIs, with span attributes for workflow context.
SLOs and alerts:
- SLO examples: monthly success rate ≥ 99.5%, p95 runtime ≤ target, freshness ≤ target, schedule delay p95 ≤ threshold.
- Alert tiers: P1 (broad impact), P2 (SLO risk), P3 (single run failure with auto-retry).
Dashboards: Daily/weekly trend panels, p95 runtime, failures by task, top flakiest tasks, freshness heatmap.

Worked examples

Example 1: Nightly Sales ETL runs slow

Symptom: Runtime increased from 20 min to 55 min (p95).
Check metrics: p95 runtime spike and increased queue wait time.
Open trace: See 25 min waiting on warehouse cluster and 10 min on API calls.
Logs show: Warehouse concurrency reduced due to maintenance window.
Action: Temporarily increase priority for this DAG and adjust schedule to avoid maintenance window. Add alert on queue wait p95 > 10 min.

Example 2: Data quality failure on daily Customers DAG

Symptom: DQ pass rate dips to 92% (threshold 98%).
Structured log: task_id=validate_customers; shows 8% null emails.
Trace: Upstream ingest task shows schema drift (new optional column).
Action: Add null-handling transform, update expectation suite, backfill last 2 days, re-run DQ. Add alert on null_rate > 2%.

Example 3: Missed SLA due to schedule delays

Symptom: p95 schedule delay increases from 2 min to 20 min.
Metrics: Backlog grows and executor saturation rises.
Trace: Orchestrator scheduler loop takes longer after adding 300 new tasks.
Action: Scale scheduler/executor, shard large DAG, add alert on backlog size and schedule delay. Track success rate post-change.

How to set it up (vendor-neutral steps)

Define context fields: dag_id, task_id, run_id, attempt, correlation_id, env, owner.
Emit metrics at task start/finish and DAG finish: status, duration_ms, retries, queue_wait_ms, schedule_delay_ms, data_freshness_min.
Log in structured JSON lines with the same context fields; include error types and key parameters.
Instrument traces so each task is a span; propagate correlation_id to downstream services.
Create SLOs and alert rules: success rate, p95 runtime, p95 schedule delay, freshness, DQ pass rate.
Build dashboards: overview (SLOs), reliability (failures, retries), performance (runtimes, queue), quality (DQ), freshness.
Run a game day: intentionally fail a non-critical task and verify detection, alert, and recovery steps.

Checklist: Good observability for a DAG

All logs include dag_id, task_id, run_id, correlation_id.
Task start/finish metrics emitted with status and duration.
Trace spans cover orchestrator → compute → storage.
Alerts exist for failures, schedule delay, freshness, and DQ.
Dashboard shows p95 runtime and success rate trends.
Runbook linked in alert text (short, actionable steps).

Exercises — practice

Note: Everyone can do the exercises and take the quick test. Only logged-in users get saved progress.

Exercise 1: Define SLOs and alerts for a nightly marketing DAG

See the Exercises section below for full instructions and a solution. ID: ex1.

Exercise 2: Troubleshoot a slow task with logs, metrics, and traces

See the Exercises section below for full instructions and a solution. ID: ex2.

Common mistakes and self-check

Only alerting on failures, not on delays or freshness. Self-check: Do you alert on p95 schedule delay and data freshness?
Logs without context. Self-check: Do logs include dag_id, task_id, run_id, correlation_id?
Dashboards without distributions. Self-check: Do you track p95/p99 runtimes, not just averages?
No ownership in alerts. Self-check: Does each alert include owner/team and a runbook hint?
Over-alerting. Self-check: Are low-priority issues routed as P3 or muted during maintenance?

Practical projects

Instrument one production DAG with structured logs, task metrics, and a trace span per task. Add two alerts (failure and schedule delay).
Create a reliability dashboard: run success rate, failures by task, p95 runtime, backlog size, and freshness.
Backfill game day: simulate a failure in a non-critical task, validate that alerts fire, then fix and measure time-to-recovery.

Mini challenge

Your nightly Customer 360 DAG has p95 runtime creeping from 30 to 50 minutes over two weeks, with no increase in data volume. In one paragraph, describe the top three metrics and trace spans you would inspect first, the likely culprits, and the smallest change you would test to confirm.

Who this is for

Data Platform Engineers who own orchestration and scheduling.
SREs supporting data infrastructure.

Prerequisites

Basic DAG/workflow concepts (tasks, retries, schedules).
Familiarity with logging and metrics at a high level.
Ability to read dashboards and interpret percentiles.

Learning path

Instrument logs with context fields.
Add task and DAG-level metrics (status, duration, delay).
Introduce traces across critical tasks.
Define SLOs and create alert policies.
Build a dashboard and run a game day.

Next steps

Apply these patterns to your top 3 business-critical DAGs.
Standardize a logging/metrics/tracing library for all workflows.
Review alert noise monthly and tune thresholds.

Menu

Observability For Workflows

Table of Contents

Why this matters

Concept explained simply

Mental model

Core components to implement

Worked examples

How to set it up (vendor-neutral steps)

Exercises — practice

Exercise 1: Define SLOs and alerts for a nightly marketing DAG

Exercise 2: Troubleshoot a slow task with logs, metrics, and traces

Common mistakes and self-check

Practical projects

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Practice Exercises

Define SLOs and alerts for a nightly marketing DAG

Instructions

Expected Output

Troubleshoot a slow task using logs, metrics, and traces

Observability For Workflows — Quick Test

Have questions about Observability For Workflows?

AI Assistant