Topic Not Found

Who this is for

MLOps Engineers building and running pipelines in Airflow, Prefect, Dagster, Kubeflow, or similar.
Data/Platform Engineers responsible for reliability, alerting, and on-call.
Data Scientists who need trustworthy, debuggable scheduled workflows.

Prerequisites

Basic understanding of workflow/DAG concepts (tasks, retries, schedules, backfills).
Familiarity with logs and metrics (counters, gauges, histograms).
Know how your orchestrator defines a run (dag_run/flow_run/pipeline_run).

Why this matters

In real MLOps, success is not just running a pipeline—it’s knowing when it’s slow, flaky, or blocked, and fixing it fast. Observability turns unknowns into signals you can act on.

Pinpoint the task that made a nightly training job miss its SLA.
Prove a model drift alert is due to a late upstream dataset, not your code.
Reduce MTTR by correlating logs, metrics, and traces across tasks.
Confidently backfill without flooding downstream services or paging on-call.

Concept explained simply

Observability for workflows means your pipelines “explain themselves” with signals:

Metrics: numbers you chart (e.g., run duration, success rate, queue time).
Logs: structured records that tell you events and context.
Traces: timelines that show how tasks connect and where time is spent.
Events: state changes (queued, running, success, failed, retried).

Mental model

Think of a workflow as a tree of steps. Each step emits signals tagged with the same run identifiers. If every step speaks the same language (correlation IDs, task names, dataset versions), you can stitch the story together and act fast.

What “good” looks like

Every run has a unique run_id and correlation_id that appears in logs, metrics, and traces.
Dashboards show: success rate, failure rate, runtime, wait time, retries, and critical resource usage.
Alerts are tied to SLOs, not noise (e.g., page only when user-impacting).
Data quality gates block bad data and explain why they blocked.

What to observe (signals and tags)

Reliability: run_success_total, run_failure_total, task_retry_total, consecutive_failures.
Timeliness: run_duration_seconds, task_duration_seconds, time_to_start (queue/wait), schedule_delay_seconds.
Throughput: runs_started_total, tasks_per_run, items_processed_total.
Resources: cpu_seconds, memory_mb, gpu_time_seconds (if applicable).
Data quality: rows_checked, checks_failed_total, data_freshness_seconds.
Lineage and versioning: dataset_name, dataset_version, model_version, code_commit.
Context tags (critical): orchestrator, dag_name/flow_name, task_name, run_id/flow_run_id, correlation_id, environment, owner/team, region.

Minimal tag set to standardize

orchestrator
workflow_name
task_name
run_id
correlation_id
env
version (code/model/data)

Designing observability for workflows

Step 1: Define SLOs

Reliability SLO: e.g., 7e2 "Nightly workflow succeeds 9 99% in a rolling 30 days"
Timeliness SLO: e.g., 7e2 "Complete by 03:00 UTC on weekdays"
Freshness SLO: e.g., 7e2 "Source table updated within 2h of schedule"

Step 2: Instrument

Emit metrics on every task transition (start, success, fail, retry).
Use structured logs with JSON fields for run_id and correlation_id.
Wrap tasks with tracing spans; propagate correlation_id downstream.

Step 3: Dashboards

Top: SLO status (green/amber/red), success rate, failure rate.
Middle: runtime p50/p95, queue time, retries, critical tasks list.
Bottom: last errors, flaky tasks, capacity (workers/executors usage).

Step 4: Alerting

Page on-call for user-impacting SLO breaches and sustained failures.
Notify (non-paging) for single-run failures or transient slowdowns.
Include run_id, task_name, correlation_id, and next steps in alerts.

Step 5: Operations

Runbooks: one-liners to re-run a task, skip, or backfill safely.
Error budgets: track SLO burn and decide when to slow new changes.
Postmortems: record signals that were missing; add them.

Worked examples

Example 1: API rate limits stalling a task (Airflow-like)

Symptoms: Task retries climb; runtime balloons; nightly deadline missed.

Signals to add:

task_duration_seconds histogram with labels task_name, endpoint
task_retry_total counter
rate_limit_hits_total counter

Fix:

Backoff retries with jitter; respect 429 headers.
Alert when rate_limit_hits_total exceeds threshold for 10 minutes.
Dashboard panel: p95 task duration and retry count trend.

Outcome

MTTR drops because you can prove cause (rate limits) and apply throttle or caching.

Example 2: Data quality gate prevents bad training (Prefect-like)

Symptoms: Model metrics degrade intermittently.

Add a data quality task before training:

Emit checks_failed_total and rows_checked.
Block downstream on failure; log which rule failed and dataset_version.
Notify (not page) on single failure; page if 3+ consecutive days fail.

Outcome

Failures become explainable and contained; training is protected.

Example 3: Slow pipeline start due to image pulls (Kubeflow-like)

Symptoms: High time_to_start (queue/wait) without CPU pressure.

Signals:

time_to_start_seconds (queued - started)
image_pull_duration_seconds span attribute

Fix:

Pre-pull images on nodes; set node affinity; warm pool in schedule window.
Alert if time_to_start p95 > 5m for 30 min.

Outcome

Timeliness SLO recovers; dashboards show queue time drop.

Exercises

Note: Everyone can take the exercises and the quick test. Only logged-in users have their progress saved.

Exercise 1: Define SLOs and alerts for a nightly model training workflow

Scenario: Workflow runs at 01:00 UTC. It fetches data, validates quality, trains, and publishes a model.

Deliver: SLOs (reliability, timeliness, freshness), alert rules, and the key metrics you will chart.

Hints

Use p95 for runtime thresholds to avoid alerting on rare outliers.
Separate paging vs. notification conditions.

Exercise 2: Correlation and logging schema for a 3-task pipeline

Scenario: Tasks: extract_raw, transform_clean, train_model. You need consistent correlation across metrics, logs, and traces.

Deliver: The minimal set of fields that must appear in logs and tags, and how run_id/correlation_id propagate.

Hints

Think: workflow_name, task_name, run_id, correlation_id, dataset_version, code_commit, env.
Keep it minimal and consistent.

Use this checklist before submitting:
- Do my SLOs map to observable metrics?
- Are alerts tied to user impact or sustained issues?
- Do all signals include run_id and correlation_id?
- Did I include a simple runbook step in alerts?

Common mistakes

Only having logs. Without metrics and traces, finding bottlenecks is slow.
No correlation_id. You cannot stitch cross-task stories.
Alerting on every failure. Creates noise and alert fatigue; use SLO-based alerts.
No queue time metrics. You miss capacity and scheduling issues.
Unstructured logs. Hard to search and aggregate; prefer JSON with standardized fields.
Missing data quality signals. Failures appear as training bugs instead of upstream data issues.

How to self-check

Pick a failed run. Can you find its cause in 5 minutes using dashboards?
Pick a slow run. Can you tell if time was spent waiting or executing?
Pick a bad-data incident. Can you see which check failed and where?

Practical projects

Instrument a sample workflow with standardized tags and build a dashboard: top KPIs, runtime distributions, retries, and last 10 errors.
Add a data quality gate to a pipeline and wire notification vs. paging alerts.
Create a runbook for a common failure (e.g., upstream timeout) and attach it to the alert message.

Mini challenge

Your daily inference refresh occasionally overruns by 20 minutes when upstream tables are late. Design an alert strategy that is low-noise but ensures users are not impacted, and propose one dashboard panel that would help you catch the issue earlier. Write your answer in 5 bullet points.

Learning path

Before this: Metrics and Logging Basics; Tracing Fundamentals.
Now: Observability for Workflows (this lesson).
Next: Data Lineage and Auditability; On-Call and Incident Response for Data Pipelines.

Next steps

Implement the minimal tag set in your orchestrator code.
Create a single-pane dashboard for your most critical workflow.
Convert one noisy alert into an SLO-based alert with clear runbook steps.

Menu

Observability For Workflows

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

What to observe (signals and tags)

Designing observability for workflows

Worked examples

Example 1: API rate limits stalling a task (Airflow-like)

Example 2: Data quality gate prevents bad training (Prefect-like)

Example 3: Slow pipeline start due to image pulls (Kubeflow-like)

Exercises

Exercise 1: Define SLOs and alerts for a nightly model training workflow

Exercise 2: Correlation and logging schema for a 3-task pipeline

Common mistakes

Practical projects

Mini challenge

Learning path

Next steps

Practice Exercises

Define SLOs and alerts for a nightly model training workflow

Instructions

Expected Output

Correlation and logging schema for a 3-task pipeline

Observability For Workflows — Quick Test

Have questions about Observability For Workflows?

AI Assistant