Topic Not Found

Who this is for This lesson is for MLOps Engineers and ML practitioners who design, run, and maintain data and ML pipelines using workflow engines (e.g., Airflow, Prefect, Dagster). If you touch production ML systems, DAG design and dependencies are your daily tools. Prerequisites Basic understanding of batch vs. streaming workloads Familiarity with ML pipeline stages (ingest, feature, train, evaluate, deploy) Comfort with scheduling concepts (cron, intervals, backfills) Why this matters In production ML, correctness and reliability depend on well-designed DAGs: Backfilling historical runs without double-counting or missing data Parallelizing workloads safely to reduce runtime and cost Expressing true data dependencies (not just time) for reproducibility Coordinating cross-pipeline events (e.g., train only after data lands) Handling failures with retries and idempotency to avoid side effects Concept explained simply A DAG (Directed Acyclic Graph) is a set of tasks (nodes) connected by dependencies (edges) that point forward and never loop back. Each task runs only when all its upstream tasks have succeeded (or as configured), ensuring order and correctness. Mental model Picture an assembly line. Each station only starts when the parts it needs arrive. There are no loops on the conveyor. If a station fails, it can retry without producing broken products twice. Deeper dive: Types of dependencies Data dependency: Task B needs outputs/artifacts from Task A Time dependency: Task runs at certain schedules (e.g., daily 02:00) External dependency: Wait for an upstream job in another DAG Conditional dependency: Branching (if metrics pass, continue; else stop) Resource dependency: Limited pool/semaphore (e.g., 1 GPU) Retry/idempotency dependency: Safe to re-run without double side effects Principled DAG design Make tasks small and single-purpose (clear inputs/outputs) Declare explicit data dependencies; avoid relying on schedule order Ensure idempotency: reruns should not duplicate side effects Parameterize by run date/partition; avoid global mutable state Name tasks clearly: verb-object (e.g., extract_raw, compute_features) Use retries with exponential backoff for transient failures Limit concurrency via pools/queues to respect resource quotas Capture lineage and artifacts (paths, URIs, versions) explicitly Step-by-step to design a DAG List required outputs and success criteria Break outputs into minimal tasks with clear inputs/outputs Draw dependencies as arrows (A → B) Mark parallelizable tasks; define pools/limits Define retry logic and idempotent writes Plan backfill and late data handling Document parameters (dates, partitions, model versions) Worked examples Example 1: Daily training DAG Goal: Train a model daily on yesterday's data and register it if metrics pass. extract_raw → validate_raw → compute_features → train_model → evaluate_model → register_candidate (conditional on metrics) Dependencies: evaluate_model depends on train_model and compute_features; register_candidate depends on evaluate_model passing threshold Parallelism: extract_raw and validate_raw are sequential; nothing else starts until validate_raw completes Idempotency: write artifacts to run-specific paths (e.g., date partitions) See edges A: extract_raw B: validate_raw (A) C: compute_features (B) D: train_model (C) E: evaluate_model (C, D) F: register_candidate (E passes) Example 2: Batch scoring with fan-out/fan-in Goal: Score N partitions in parallel and publish a single merged report. list_partitions → map(score_partition[i]) → reduce(merge_scores) Dependencies: Each score_partition depends on list_partitions; merge_scores depends on all score_partition[i] Resource controls: GPU pool size 2; at most 2 partitions score concurrently See edges A: list_partitions B[i]: score_partition[i] (A) C: merge_scores (all B[i]) Example 3: Monitoring and auto-retrain Goal: Monitor live metrics daily; if drift detected, trigger retrain DAG. ingest_monitoring → compute_drift → branch: [trigger_retrain, end] External dependency: retrain DAG waits on trigger event Idempotency: triggers are deduplicated by run date See edges M1: ingest_monitoring M2: compute_drift (M1) M3a: trigger_retrain (M2 drift=true) M3b: end (M2 drift=false) Common dependency patterns Fan-out/fan-in: map over partitions then aggregate Map-reduce: transformation followed by summarization Conditional branch: continue or halt on metric thresholds External sensors: wait for upstream table or file arrival Backfill-safe design: parameterize by date/partition; avoid global overwrites Resource pools/semaphores: prevent overload of shared systems Cross-DAG dependencies: explicit, versioned, and deduplicated triggers Common mistakes and how to self-check Hidden dependencies: task assumes files exist without declaring upstream Monolithic tasks: hard to retry, slow, and brittle Non-idempotent side effects: duplicate rows, double notifications Overusing in-memory passing: rely on scheduler state instead of artifact storage Accidental cycles via time: task re-reads and mutates previous runs Concurrency explosions: no pools leading to resource throttling Self-check checklist Every task has explicit inputs and outputs Re-running a task does not duplicate side effects Backfill plan will not overwrite global paths Parallel tasks are limited by resource pools Branching is deterministic and well-logged Cross-DAG dependencies are versioned and deduplicated Exercises you will do Do these in a notebook or text editor. Write your DAG edges as lines like A → B. Exercise 1: Daily training DAG dependencies Design edges and notes for idempotency and parallelism for a daily training pipeline. What to produce Edges list (A → B...) Which tasks can run in parallel Where to apply retries and how to ensure idempotency Exercise 2: Partitioned batch scoring with retries Design a fan-out/fan-in scoring DAG over 8 partitions with GPU pool=2 and retry policy. What to produce Edges list Concurrency limits Retry/backoff settings and idempotent output scheme Checklist before you compare with solutions All edges reflect true data needs (not just time) Idempotency is addressed for each side-effecting task Concurrency limits are realistic Backfills will not corrupt current outputs Practical projects Project 1: Build a daily feature pipeline with backfill. Include validate → compute_features → publish_features. Add a backfill plan for the last 30 days. Project 2: Create a fan-out scoring job over partitions with GPU pool limits and a merge step. Produce a run report per date. Project 3: Implement a monitoring DAG that branches on drift and triggers a retrain DAG. Ensure trigger deduplication by run key. Learning path Start with expressing clear single-DAG dependencies Add partitioning and fan-out/fan-in Introduce external sensors and cross-DAG triggers Harden with retries, pools, and idempotent storage layouts Practice backfills and late-arriving data scenarios Next steps Complete the exercises, then take the Quick Test below Integrate these patterns into a real pipeline at small scale Gradually add concurrency and backfill complexity Quick Test is available to everyone; only logged-in users get saved progress. Mini challenge Design a DAG that: (1) computes features for yesterday, (2) scores live traffic samples, (3) triggers retrain if drift > 0.1, (4) performs a canary deploy, and (5) promotes to production if canary error delta Acceptance criteria List edges and branch conditions Show resource limits for scoring Describe idempotent artifact locations Explain backfill impact on (1) and isolation from (4)-(5)

Menu

DAG Design And Dependencies

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Principled DAG design

Step-by-step to design a DAG

Worked examples

Example 1: Daily training DAG

Example 2: Batch scoring with fan-out/fan-in

Example 3: Monitoring and auto-retrain

Common dependency patterns

Common mistakes and how to self-check

Self-check checklist

Exercises you will do

Exercise 1: Daily training DAG dependencies

Exercise 2: Partitioned batch scoring with retries

Checklist before you compare with solutions

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Design a daily training DAG with clear dependencies

Instructions

Expected Output

Partitioned batch scoring with fan-out/fan-in and retries

DAG Design And Dependencies — Quick Test

Have questions about DAG Design And Dependencies?

AI Assistant