Topic Not Found

Why this matters

Data engineers run dozens of pipelines daily. Good DAG (Directed Acyclic Graph) design prevents delays, broken data, and costly backfills. You will use it to: coordinate upstream and downstream jobs, run tasks in parallel safely, recover from failures, and ship reliable datasets on schedule.

Real tasks: orchestrate daily batch loads, trigger weekly aggregates after all dailies finish, wait for files/events, and manage retries/backfills without double-counting.
Outcomes: predictable delivery times, fewer on-call incidents, and simpler debugging.

Concept explained simply

A DAG is a set of tasks (nodes) with one-way arrows (edges) that say “run B after A.” No cycles allowed. The scheduler uses edges to decide order and parallelism.

Mental model

Imagine an assembly line with checkpoints. Each station (task) works only when inputs are ready. Arrows are the conveyor belts. No belt loops back.

Core principles of solid DAG design

Deterministic. Same inputs for a run date produce the same outputs.
Idempotent tasks. Safe to re-run without corrupting data (e.g., overwrite/merge by partition, not append blindly).
Explicit dependencies. Only connect what truly depends. Avoid hidden coupling via shared state.
Small, composable tasks. Easier retries and clearer lineage.
Data-aware scheduling. Align run schedule with data availability (files, API windows, upstream SLAs).
Retries and backfills. Configure retries with jitter; design outputs so backfills don’t duplicate records.
Concurrency and resources. Use pools/limits to prevent resource contention; cap parallelism on heavy tasks.
Quality gates as dependencies. DQ checks should gate publishes/loads.
SLAs and alerts. Measure and alert on lateness to protect downstream consumers.

Example: Idempotency patterns

Write to a temp table, then atomic swap to target.
Use upserts by date partition (e.g., run_date) instead of full re-writes.
Derive output paths from execution date, not wall-clock time.

Worked examples

Example 1 — Fan-out/Fan-in daily batch

Goal: Ingest daily orders, transform by country in parallel, then merge and publish.

Tasks: extract_raw → stage_clean → split_by_country → transform_[US|UK|DE|FR] (parallel) → merge_all → dq_check → publish_mart
Edges: extract_raw → stage_clean → split_by_country → all transforms; all transforms → merge_all → dq_check → publish_mart
Notes: Limit max parallel transforms to protect warehouse. Make merge idempotent (partition overwrite).

Why this works

Parallelism speeds up processing while a single merge ensures consistent output. DQ gate prevents bad publishes.

Example 2 — Branching with late-arriving data

Goal: Run transform only after file for the day arrives; otherwise alert and stop.

Tasks: wait_for_file (sensor) → branch_has_file → [transform_path | alert_path] → end
Dependencies: wait_for_file → branch_has_file → either transform_path or alert_path
Notes: Timeouts + soft-fail on sensor to avoid blocking pools indefinitely.

Why this works

Branching keeps the DAG acyclic while expressing alternative paths. Fail fast if inputs don’t arrive.

Example 3 — Weekly aggregate after daily DAG finishes

Goal: Build weekly KPIs only when the last daily load for the week succeeds.

Design: Daily DAG writes dataset/flag for each day. Weekly DAG waits for Sunday’s success (or a week_complete flag) using an external dependency, then runs aggregate_weekly → publish_weekly.
Avoid: Direct circular references between daily and weekly DAGs.

Why this works

Separating DAGs keeps scopes clear. The weekly job depends on a single well-defined external signal.

Example 4 — Incremental vs full refresh

Goal: Rebuild a dimension table daily incrementally; monthly full refresh.

Tasks: choose_mode → [incremental_upsert | full_refresh] → dq_uniqueness → publish_dim
Dependencies: choose_mode branches by calendar; both paths converge on DQ then publish.
Notes: Keep publish step identical for both paths to simplify consumers.

Designing dependencies step-by-step

Define outputs and boundaries: raw, staging, curated, and who consumes them.
Choose schedule and triggers: based on upstream arrival times and SLAs.
Encode dependencies: only connect tasks that truly need upstream results.
Plan retries/backfills: ensure idempotent writes and partition-scoped computation.
Set timeouts, SLAs, and concurrency: prevent stuck runs and noisy neighbors.
Add DQ gates: block publishes on failed quality checks.
Observability: log run_date, input counts, and output counts for each task.

Mini task: Draw it

Take any pipeline you own. List tasks vertically, draw arrows for dependencies, and circle any node with more than 5 inbound edges (likely too coupled). Split or refactor those nodes.

Exercises

These mirror the graded exercises below. Everyone can do them for free. Only logged-in users will see saved progress.

Exercise 1 — Turn a narrative into a DAG

Narrative: Fetch marketing spend from API for a given run_date. Stage to raw. Split by channel (search, social, display). Transform each channel in parallel. Merge into a single table by run_date. Run DQ rowcount check. Publish the daily mart.

Deliverable: list tasks, valid topological order, and dependency pairs.

Hint

Start with fetch → stage; fan-out after the first safe checkpoint; add a DQ gate before publish.

Exercise 2 — Remove a hidden cycle

Scenario: You modeled aggregate_weekly → cleanup_intermediate and cleanup_intermediate → daily_load (to remove leftovers), while daily_load → aggregate_weekly. This is a cycle. Redesign.

Hint

Move cleanup into the same DAG path that creates intermediates, or run cleanup after publish in that DAG. Keep weekly DAG independent and depend only on a weekly-ready signal.

Checklist before submitting:
- Are all writes idempotent by run_date?
- Is there any circular dependency?
- Is publish gated by DQ?
- Do sensors have timeouts and clear failure behavior?

Common mistakes and self-check

Hidden coupling via shared temp tables without explicit edges. Fix: make producers write a dated artifact; consumers depend on that artifact’s success.
Non-idempotent appends causing duplicates on retries. Fix: upsert/overwrite by partition.
Oversized tasks. Fix: split into extract/transform/load steps.
Sensors with no timeout, blocking pools. Fix: set timeout and soft-fail or alert.
Excessive cross-DAG dependencies. Fix: create a single external success flag or dataset trigger.
No DQ gates before publish. Fix: add rowcount, null-rate, or schema checks as dependencies.

Self-check

Can I safely re-run yesterday’s DAG 10 times without changing the final data?
Can I backfill last week without touching unrelated dates?
Do I know exactly which upstream success unlocks this DAG?

Practical projects

Build a daily sales pipeline with fan-out by region, a merge step, and DQ gates. Include backfill for a custom date range.
Create a weekly KPI DAG that waits on the daily DAG’s Sunday run. Publish a report table with SLA alerts.
Refactor a monolithic job into 6 tasks with explicit dependencies; add concurrency limits and idempotent writes.

Who this is for

Junior to mid-level data engineers who orchestrate batch/near-real-time pipelines.
Analytic engineers needing reliable delivery to BI layers.

Prerequisites

Basic SQL and data modeling.
Familiarity with at least one orchestrator (e.g., Airflow, Prefect, Dagster).
Comfort with command-line and version control.

Learning path

Understand DAG basics and idempotency.
Practice fan-out/fan-in and branching.
Add sensors, SLAs, and DQ gates.
Design cross-DAG dependencies safely.
Implement backfill and catchup strategies.

Next steps

Implement one project above in your environment.
Add monitoring: track run durations and row counts per task.
Take the quick test to confirm you can spot design flaws.

Mini challenge

Design a DAG that consumes streaming events to a bronze table continuously and builds a daily silver snapshot at 01:00, only after at-least-1h of quiescence. Include: event-ingest task group, idle-window sensor, snapshot transform, DQ checks, and publish. Describe dependencies and failure behavior.

Quick Test

Available to everyone for free. Log in to save your progress and resume later.

Menu

DAG Design And Dependencies

Table of Contents

Why this matters

Concept explained simply

Mental model

Core principles of solid DAG design

Worked examples

Example 1 — Fan-out/Fan-in daily batch

Example 2 — Branching with late-arriving data

Example 3 — Weekly aggregate after daily DAG finishes

Example 4 — Incremental vs full refresh

Designing dependencies step-by-step

Exercises

Exercise 1 — Turn a narrative into a DAG

Exercise 2 — Remove a hidden cycle

Common mistakes and self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Turn a narrative into a DAG

Instructions

Expected Output

Remove a hidden cycle

DAG Design And Dependencies — Quick Test

Have questions about DAG Design And Dependencies?

AI Assistant