Topic Not Found

Why this matters

Dependency management keeps your ETL pipelines correct, on time, and cost-effective. In real work you will:

Guarantee that transformations run only after upstream data is complete.
Coordinate fan-out/fan-in steps (e.g., partition processing then aggregation).
Block reports until critical tables finish, but allow non-critical tasks to proceed.
Handle late-arriving files without double-processing or gaps.
Backfill safely across days without breaking today’s runs.

Concept explained simply

A dependency is a rule that says “Task B can run only after A is ready.” In ETL, readiness can mean different things: a task succeeded, a dataset partition exists, or a file is complete.

Mental model

Think of your pipeline as a Directed Acyclic Graph (DAG): nodes are tasks, edges are dependencies. Data flows forward; cycles are not allowed. Your goal is to describe readiness clearly, so each run is predictable and repeatable.

Key dependency types

Task success dependencies: Run T2 after T1 succeeds (typical baseline).
Dataset/data partition dependencies: Run when a specific table partition (e.g., date=2026-01-10) is available.
Event/file dependencies: Run after a file appears and is complete (often via a marker file, e.g., _SUCCESS).
Time dependencies: Wait until a window closes or a schedule hits.
Cross-job (external) dependencies: Job B waits for Job A’s matching partition.
Trigger rules: all_success, one_success, all_done (useful for aggregations, alerts, cleanup).
Resource/lock dependencies: Prevent overlap or contention (e.g., only one heavy job at a time).

Worked examples

Example 1: Daily sales pipeline with fan-in

Scenario:

extract_sales → transform_sales → load_sales
extract_customers → transform_customers → load_customers
build_report depends on both load_sales and load_customers

Dependencies:

extract_sales → transform_sales → load_sales
extract_customers → transform_customers → load_customers
load_sales → build_report
load_customers → build_report

Trigger rule for build_report: all_success (wait for both loads).

Why this works

Each chain protects data readiness, and the fan-in at build_report prevents partial reports. No cycles exist; multiple valid execution orders are possible.

Example 2: File arrival with completeness marker

Scenario: Run transform only when a file for a given date arrives completely in storage.

Wait for folder /ingest/date=YYYY-MM-DD/_SUCCESS (marker indicates complete delivery).
Transform partition date=YYYY-MM-DD (idempotent: write to a temp table, then atomic swap/merge).
Load partition into the warehouse.

Key dependency: transform depends on “marker present” rather than mere folder existence. This avoids partial reads.

Why markers?

Markers signal that upstream systems finished writing all shards. Without it, you risk reading partial data and breaking aggregates.

Example 3: Cross-job dependency on matching partitions

Scenario: Job A builds dim_product per date. Job B builds fact_sales and needs the same date’s dim_product ready first.

Dependency: fact_sales(date=D) waits for dim_product(date=D) completed.
Backfill: If you backfill 7 days, run pairs per date: dim_product(D) → fact_sales(D) for each D, not “latest” dim.

Common pitfall

Depending on the latest successful upstream run can silently mix partitions (e.g., dim for D-1 with fact for D). Always match on the same partition.

Designing dependencies: step-by-step

List outputs per step: Identify datasets/partitions each task produces.
Define readiness: For each consumer, write the rule (task success, partition available, marker exists).
Choose trigger rules: all_success for correctness; all_done for cleanup/alerts; one_success for optional fallbacks.
Make tasks idempotent: Use upserts/merges, partition overwrites, or temp-to-final swaps.
Add guards: Sensors/checks for file completeness and data quality gates before publish.
Prevent overlap: Use resource locks/concurrency limits to avoid two runs writing to the same partition.
Plan backfills: Express dependencies per time window; re-run windows independently.

Mini task: write a readiness rule

Pick one task in your pipeline and finish the sentence: “Task X for date D can start when ____ is true.” Keep it precise and measurable.

Common mistakes and self-check

Mistake: Depending on “folder exists” instead of a completeness signal. Fix: Use a marker (e.g., _SUCCESS) or explicit file count.
Mistake: Cross-partition dependencies (mixing dates). Fix: Match the same partition across jobs.
Mistake: Cycles in the graph. Fix: Ensure a topological order exists; no task eventually depends on itself.
Mistake: Non-idempotent writes causing duplicates after retries. Fix: Use upsert/merge, overwrite-by-partition, or unique run identifiers.
Mistake: Overstrict triggers blocking useful work (e.g., dashboards waiting on non-critical steps). Fix: Split critical vs optional with different trigger rules.

Self-check

Can every task state its precise readiness condition?
Could your pipeline re-run yesterday safely, without manual cleanup?
Do aggregations wait for all contributing partitions?
Is there any possible loop? Try to draw a topological order.

Exercises

Note: The quick test is available to everyone. Only logged-in users will see saved progress.

Exercise 1: Find a valid execution order for a small DAG.
Exercise 2: Define explicit dependencies for a business scenario.

Checklist before you start

I can name the outputs of each task.
I know which dependencies are task-based vs data-based.
I understand trigger rules: all_success, all_done, one_success.
My tasks are idempotent or have a plan to be.

Practical projects

Project 1: Build a daily DAG that ingests two sources, transforms them, loads two tables, and builds a report with a fan-in dependency.
Project 2: Implement file-based dependencies using a completeness marker and perform a 3-day backfill safely.
Project 3: Create two separate jobs where one produces a dimension per date and the other consumes it; enforce partition-matched cross-job dependencies.

Mini challenge

Design dependencies for a weekly rollup that aggregates 7 daily partitions into a weekly table. Requirements:

Daily partitions must all be complete for the same week.
Late days should not publish partial weekly data.
Backfills should re-compute only affected weeks.

Write: (1) the readiness rule for the weekly task, (2) how you would prevent partial outputs, (3) how you would re-run a specific week safely.

Learning path

Before: Understand basic job scheduling and retries.
Now: Master dependency management (this page), especially partition-matched rules and idempotency.
Next: Add data quality gates, SLAs, and alerting to protect downstream consumers.

Who this is for

ETL/ELT developers building scheduled data pipelines.
Data engineers orchestrating multi-step jobs and backfills.
Analytics engineers coordinating model builds with upstream ingestion.

Prerequisites

Basic understanding of ETL/ELT tasks and data warehouses.
Familiarity with scheduling concepts (intervals, retries).
Comfort with partitioned data and idempotent writes.

Next steps

Complete the exercises below and take the quick test.
Refactor one of your existing pipelines to use explicit partition-matched dependencies and a completeness marker.
Add a cleanup task with all_done to capture logs and metrics even on failure.

Menu

Dependency Management

Table of Contents

Why this matters

Concept explained simply

Mental model

Key dependency types

Worked examples

Example 1: Daily sales pipeline with fan-in

Example 2: File arrival with completeness marker

Example 3: Cross-job dependency on matching partitions

Designing dependencies: step-by-step

Common mistakes and self-check

Exercises

Practical projects

Mini challenge

Learning path

Who this is for

Prerequisites

Next steps

Practice Exercises

Find a valid execution order for a small DAG

Instructions

Expected Output

Define explicit dependencies for a file-driven daily job

Dependency Management — Quick Test

Have questions about Dependency Management?

AI Assistant