Who this is for
This lesson is for MLOps Engineers and ML practitioners who design, run, and maintain data and ML pipelines using workflow engines (e.g., Airflow, Prefect, Dagster). If you touch production ML systems, DAG design and dependencies are your daily tools.
Prerequisites
- Basic understanding of batch vs. streaming workloads
- Familiarity with ML pipeline stages (ingest, feature, train, evaluate, deploy)
- Comfort with scheduling concepts (cron, intervals, backfills)
Why this matters
In production ML, correctness and reliability depend on well-designed DAGs:
- Backfilling historical runs without double-counting or missing data
- Parallelizing workloads safely to reduce runtime and cost
- Expressing true data dependencies (not just time) for reproducibility
- Coordinating cross-pipeline events (e.g., train only after data lands)
- Handling failures with retries and idempotency to avoid side effects
Concept explained simply
A DAG (Directed Acyclic Graph) is a set of tasks (nodes) connected by dependencies (edges) that point forward and never loop back. Each task runs only when all its upstream tasks have succeeded (or as configured), ensuring order and correctness.
Mental model
Picture an assembly line. Each station only starts when the parts it needs arrive. There are no loops on the conveyor. If a station fails, it can retry without producing broken products twice.
Deeper dive: Types of dependencies
- Data dependency: Task B needs outputs/artifacts from Task A
- Time dependency: Task runs at certain schedules (e.g., daily 02:00)
- External dependency: Wait for an upstream job in another DAG
- Conditional dependency: Branching (if metrics pass, continue; else stop)
- Resource dependency: Limited pool/semaphore (e.g., 1 GPU)
- Retry/idempotency dependency: Safe to re-run without double side effects
Principled DAG design
- Make tasks small and single-purpose (clear inputs/outputs)
- Declare explicit data dependencies; avoid relying on schedule order
- Ensure idempotency: reruns should not duplicate side effects
- Parameterize by run date/partition; avoid global mutable state
- Name tasks clearly: verb-object (e.g., extract_raw, compute_features)
- Use retries with exponential backoff for transient failures
- Limit concurrency via pools/queues to respect resource quotas
- Capture lineage and artifacts (paths, URIs, versions) explicitly
Step-by-step to design a DAG
- List required outputs and success criteria
- Break outputs into minimal tasks with clear inputs/outputs
- Draw dependencies as arrows (A → B)
- Mark parallelizable tasks; define pools/limits
- Define retry logic and idempotent writes
- Plan backfill and late data handling
- Document parameters (dates, partitions, model versions)
Worked examples
Example 1: Daily training DAG
Goal: Train a model daily on yesterday's data and register it if metrics pass.
- extract_raw → validate_raw → compute_features → train_model → evaluate_model → register_candidate (conditional on metrics)
- Dependencies: evaluate_model depends on train_model and compute_features; register_candidate depends on evaluate_model passing threshold
- Parallelism: extract_raw and validate_raw are sequential; nothing else starts until validate_raw completes
- Idempotency: write artifacts to run-specific paths (e.g., date partitions)
See edges
A: extract_raw B: validate_raw (A) C: compute_features (B) D: train_model (C) E: evaluate_model (C, D) F: register_candidate (E passes)
Example 2: Batch scoring with fan-out/fan-in
Goal: Score N partitions in parallel and publish a single merged report.
- list_partitions → map(score_partition[i]) → reduce(merge_scores)
- Dependencies: Each score_partition depends on list_partitions; merge_scores depends on all score_partition[i]
- Resource controls: GPU pool size 2; at most 2 partitions score concurrently
See edges
A: list_partitions B[i]: score_partition[i] (A) C: merge_scores (all B[i])
Example 3: Monitoring and auto-retrain
Goal: Monitor live metrics daily; if drift detected, trigger retrain DAG.
- ingest_monitoring → compute_drift → branch: [trigger_retrain, end]
- External dependency: retrain DAG waits on trigger event
- Idempotency: triggers are deduplicated by run date
See edges
M1: ingest_monitoring M2: compute_drift (M1) M3a: trigger_retrain (M2 drift=true) M3b: end (M2 drift=false)
Common dependency patterns
- Fan-out/fan-in: map over partitions then aggregate
- Map-reduce: transformation followed by summarization
- Conditional branch: continue or halt on metric thresholds
- External sensors: wait for upstream table or file arrival
- Backfill-safe design: parameterize by date/partition; avoid global overwrites
- Resource pools/semaphores: prevent overload of shared systems
- Cross-DAG dependencies: explicit, versioned, and deduplicated triggers
Common mistakes and how to self-check
- Hidden dependencies: task assumes files exist without declaring upstream
- Monolithic tasks: hard to retry, slow, and brittle
- Non-idempotent side effects: duplicate rows, double notifications
- Overusing in-memory passing: rely on scheduler state instead of artifact storage
- Accidental cycles via time: task re-reads and mutates previous runs
- Concurrency explosions: no pools leading to resource throttling
Self-check checklist
- Every task has explicit inputs and outputs
- Re-running a task does not duplicate side effects
- Backfill plan will not overwrite global paths
- Parallel tasks are limited by resource pools
- Branching is deterministic and well-logged
- Cross-DAG dependencies are versioned and deduplicated
Exercises you will do
Do these in a notebook or text editor. Write your DAG edges as lines like A → B.
Exercise 1: Daily training DAG dependencies
Design edges and notes for idempotency and parallelism for a daily training pipeline.
What to produce
- Edges list (A → B...)
- Which tasks can run in parallel
- Where to apply retries and how to ensure idempotency
Exercise 2: Partitioned batch scoring with retries
Design a fan-out/fan-in scoring DAG over 8 partitions with GPU pool=2 and retry policy.
What to produce
- Edges list
- Concurrency limits
- Retry/backoff settings and idempotent output scheme
Checklist before you compare with solutions
- All edges reflect true data needs (not just time)
- Idempotency is addressed for each side-effecting task
- Concurrency limits are realistic
- Backfills will not corrupt current outputs
Practical projects
- Project 1: Build a daily feature pipeline with backfill. Include validate → compute_features → publish_features. Add a backfill plan for the last 30 days.
- Project 2: Create a fan-out scoring job over partitions with GPU pool limits and a merge step. Produce a run report per date.
- Project 3: Implement a monitoring DAG that branches on drift and triggers a retrain DAG. Ensure trigger deduplication by run key.
Learning path
- Start with expressing clear single-DAG dependencies
- Add partitioning and fan-out/fan-in
- Introduce external sensors and cross-DAG triggers
- Harden with retries, pools, and idempotent storage layouts
- Practice backfills and late-arriving data scenarios
Next steps
- Complete the exercises, then take the Quick Test below
- Integrate these patterns into a real pipeline at small scale
- Gradually add concurrency and backfill complexity
Quick Test is available to everyone; only logged-in users get saved progress.
Mini challenge
Design a DAG that: (1) computes features for yesterday, (2) scores live traffic samples, (3) triggers retrain if drift > 0.1, (4) performs a canary deploy, and (5) promotes to production if canary error delta < 2%.
Acceptance criteria
- List edges and branch conditions
- Show resource limits for scoring
- Describe idempotent artifact locations
- Explain backfill impact on (1) and isolation from (4)-(5)