Who this is for
- MLOps Engineers building and running pipelines in Airflow, Prefect, Dagster, Kubeflow, or similar.
- Data/Platform Engineers responsible for reliability, alerting, and on-call.
- Data Scientists who need trustworthy, debuggable scheduled workflows.
Prerequisites
- Basic understanding of workflow/DAG concepts (tasks, retries, schedules, backfills).
- Familiarity with logs and metrics (counters, gauges, histograms).
- Know how your orchestrator defines a run (dag_run/flow_run/pipeline_run).
Why this matters
In real MLOps, success is not just running a pipeline—it’s knowing when it’s slow, flaky, or blocked, and fixing it fast. Observability turns unknowns into signals you can act on.
- Pinpoint the task that made a nightly training job miss its SLA.
- Prove a model drift alert is due to a late upstream dataset, not your code.
- Reduce MTTR by correlating logs, metrics, and traces across tasks.
- Confidently backfill without flooding downstream services or paging on-call.
Concept explained simply
Observability for workflows means your pipelines “explain themselves” with signals:
- Metrics: numbers you chart (e.g., run duration, success rate, queue time).
- Logs: structured records that tell you events and context.
- Traces: timelines that show how tasks connect and where time is spent.
- Events: state changes (queued, running, success, failed, retried).
Mental model
Think of a workflow as a tree of steps. Each step emits signals tagged with the same run identifiers. If every step speaks the same language (correlation IDs, task names, dataset versions), you can stitch the story together and act fast.
What “good” looks like
- Every run has a unique run_id and correlation_id that appears in logs, metrics, and traces.
- Dashboards show: success rate, failure rate, runtime, wait time, retries, and critical resource usage.
- Alerts are tied to SLOs, not noise (e.g., page only when user-impacting).
- Data quality gates block bad data and explain why they blocked.
What to observe (signals and tags)
- Reliability: run_success_total, run_failure_total, task_retry_total, consecutive_failures.
- Timeliness: run_duration_seconds, task_duration_seconds, time_to_start (queue/wait), schedule_delay_seconds.
- Throughput: runs_started_total, tasks_per_run, items_processed_total.
- Resources: cpu_seconds, memory_mb, gpu_time_seconds (if applicable).
- Data quality: rows_checked, checks_failed_total, data_freshness_seconds.
- Lineage and versioning: dataset_name, dataset_version, model_version, code_commit.
- Context tags (critical): orchestrator, dag_name/flow_name, task_name, run_id/flow_run_id, correlation_id, environment, owner/team, region.
Minimal tag set to standardize
- orchestrator
- workflow_name
- task_name
- run_id
- correlation_id
- env
- version (code/model/data)
Designing observability for workflows
- Reliability SLO: e.g., 7e2 "Nightly workflow succeeds 9 99% in a rolling 30 days"
- Timeliness SLO: e.g., 7e2 "Complete by 03:00 UTC on weekdays"
- Freshness SLO: e.g., 7e2 "Source table updated within 2h of schedule"
- Emit metrics on every task transition (start, success, fail, retry).
- Use structured logs with JSON fields for run_id and correlation_id.
- Wrap tasks with tracing spans; propagate correlation_id downstream.
- Top: SLO status (green/amber/red), success rate, failure rate.
- Middle: runtime p50/p95, queue time, retries, critical tasks list.
- Bottom: last errors, flaky tasks, capacity (workers/executors usage).
- Page on-call for user-impacting SLO breaches and sustained failures.
- Notify (non-paging) for single-run failures or transient slowdowns.
- Include run_id, task_name, correlation_id, and next steps in alerts.
- Runbooks: one-liners to re-run a task, skip, or backfill safely.
- Error budgets: track SLO burn and decide when to slow new changes.
- Postmortems: record signals that were missing; add them.
Worked examples
Example 1: API rate limits stalling a task (Airflow-like)
Symptoms: Task retries climb; runtime balloons; nightly deadline missed.
Signals to add:
- task_duration_seconds histogram with labels task_name, endpoint
- task_retry_total counter
- rate_limit_hits_total counter
Fix:
- Backoff retries with jitter; respect 429 headers.
- Alert when rate_limit_hits_total exceeds threshold for 10 minutes.
- Dashboard panel: p95 task duration and retry count trend.
Outcome
MTTR drops because you can prove cause (rate limits) and apply throttle or caching.
Example 2: Data quality gate prevents bad training (Prefect-like)
Symptoms: Model metrics degrade intermittently.
Add a data quality task before training:
- Emit checks_failed_total and rows_checked.
- Block downstream on failure; log which rule failed and dataset_version.
- Notify (not page) on single failure; page if 3+ consecutive days fail.
Outcome
Failures become explainable and contained; training is protected.
Example 3: Slow pipeline start due to image pulls (Kubeflow-like)
Symptoms: High time_to_start (queue/wait) without CPU pressure.
Signals:
- time_to_start_seconds (queued - started)
- image_pull_duration_seconds span attribute
Fix:
- Pre-pull images on nodes; set node affinity; warm pool in schedule window.
- Alert if time_to_start p95 > 5m for 30 min.
Outcome
Timeliness SLO recovers; dashboards show queue time drop.
Exercises
Note: Everyone can take the exercises and the quick test. Only logged-in users have their progress saved.
Exercise 1: Define SLOs and alerts for a nightly model training workflow
Scenario: Workflow runs at 01:00 UTC. It fetches data, validates quality, trains, and publishes a model.
- Deliver: SLOs (reliability, timeliness, freshness), alert rules, and the key metrics you will chart.
Hints
- Use p95 for runtime thresholds to avoid alerting on rare outliers.
- Separate paging vs. notification conditions.
Exercise 2: Correlation and logging schema for a 3-task pipeline
Scenario: Tasks: extract_raw, transform_clean, train_model. You need consistent correlation across metrics, logs, and traces.
- Deliver: The minimal set of fields that must appear in logs and tags, and how run_id/correlation_id propagate.
Hints
- Think: workflow_name, task_name, run_id, correlation_id, dataset_version, code_commit, env.
- Keep it minimal and consistent.
- Use this checklist before submitting:
- Do my SLOs map to observable metrics?
- Are alerts tied to user impact or sustained issues?
- Do all signals include run_id and correlation_id?
- Did I include a simple runbook step in alerts?
Common mistakes
- Only having logs. Without metrics and traces, finding bottlenecks is slow.
- No correlation_id. You cannot stitch cross-task stories.
- Alerting on every failure. Creates noise and alert fatigue; use SLO-based alerts.
- No queue time metrics. You miss capacity and scheduling issues.
- Unstructured logs. Hard to search and aggregate; prefer JSON with standardized fields.
- Missing data quality signals. Failures appear as training bugs instead of upstream data issues.
How to self-check
- Pick a failed run. Can you find its cause in 5 minutes using dashboards?
- Pick a slow run. Can you tell if time was spent waiting or executing?
- Pick a bad-data incident. Can you see which check failed and where?
Practical projects
- Instrument a sample workflow with standardized tags and build a dashboard: top KPIs, runtime distributions, retries, and last 10 errors.
- Add a data quality gate to a pipeline and wire notification vs. paging alerts.
- Create a runbook for a common failure (e.g., upstream timeout) and attach it to the alert message.
Mini challenge
Your daily inference refresh occasionally overruns by 20 minutes when upstream tables are late. Design an alert strategy that is low-noise but ensures users are not impacted, and propose one dashboard panel that would help you catch the issue earlier. Write your answer in 5 bullet points.
Learning path
- Before this: Metrics and Logging Basics; Tracing Fundamentals.
- Now: Observability for Workflows (this lesson).
- Next: Data Lineage and Auditability; On-Call and Incident Response for Data Pipelines.
Next steps
- Implement the minimal tag set in your orchestrator code.
- Create a single-pane dashboard for your most critical workflow.
- Convert one noisy alert into an SLO-based alert with clear runbook steps.