Why this matters
Scheduling and dependencies ensure your data transforms run on time, in the right order, and only when inputs are ready. As an Analytics Engineer, you will:
- Run daily dbt models after raw data lands.
- Backfill historical partitions safely when logic changes.
- Pause or delay downstream jobs if upstream data is late.
- Prevent overlapping runs and protect shared compute with concurrency limits.
- Alert stakeholders when SLAs are at risk.
Concept explained simply
Scheduling decides when a workflow starts. Dependencies decide what must finish before the next step can start. Together they form a DAG (Directed Acyclic Graph) where nodes are tasks and arrows show order.
Mental model
Imagine a morning routine:
- Start at 6:30.
- Make coffee before breakfast. Breakfast before leaving home.
- If coffee beans are missing (upstream not ready), wait up to 10 minutes, then proceed with tea (fallback).
- Never do two full routines at once (max concurrency = 1).
That is scheduling (6:30), dependencies (coffee → breakfast → leave), readiness checks (beans exist), fallback (tea), and concurrency (only one routine at a time).
Core ideas and terminology
- Schedule: when to trigger (cron like 30 6 * * * or fixed intervals).
- DAG: tasks with one-way dependencies (no cycles).
- Upstream/Downstream: task A upstream of B means A must succeed before B starts.
- Catchup: whether missed past runs should execute when a job is enabled.
- Timezone: define a single source of truth to avoid daylight saving surprises.
- Retries and Backoff: automatic re-attempts to handle transient failures.
- Timeouts and SLAs: stop long-running tasks; alert if the whole run exceeds a target duration.
- Sensors/Readiness checks: wait for files/partitions/tables to exist before proceeding.
- Concurrency, Pools, Priorities: limit parallelism to protect resources.
- Idempotency: re-running the same date partition produces the same result (safe backfills).
Worked examples
Example 1 — Daily dbt job with clear dependencies
Goal: At 06:15 in Europe/Berlin, run: snapshots → staging models → marts → tests. Do not backfill past runs automatically.
Schedule: 15 6 * * * (Europe/Berlin)
Catchup: false
DAG order: snapshots -> stg_* -> dim_*/fact_* -> tests
Retries: 2 with 10m exponential backoff
Timeout: each task 45m; whole DAG SLA 2h
Concurrency: 1 active run; queue extras
Why it works: snapshots ensure source-of-truth stability; staging prepares clean inputs; marts build business tables; tests validate outputs; catchup=false avoids accidental mass backfills.
Example 2 — Wait for partition before transform
Goal: For partition ds=2025-01-10, only run transforms after raw table partition exists.
Sensor: wait for raw.sales partition=ds
Max wait: 30m; poke every 2m; resource-light mode
On timeout: skip transform and alert; mark downstream as skipped
Why it works: prevents transforms from running on incomplete data; avoids misleading downstream dashboards.
Example 3 — Safe 7-day backfill with limits
Goal: Recompute last 7 days after fixing a logic bug, without overloading the warehouse.
Backfill window: ds in [D-7, D-1]
Concurrency: 2 date-partitions at a time
Idempotent write: overwrite or merge by ds
Dependencies: for each ds: snapshots -> stg -> marts -> tests
Alerts: notify if any ds fails; continue others
Why it works: bounded parallelism protects compute; idempotent writes prevent duplicates; partition-scoped dependencies maintain correctness.
Helpful design patterns
- Fan-in/Fan-out: parallelize per-source tasks, then aggregate to a single step.
- Late data guardrail: sensor with max wait, then fallback logic or skip-and-alert.
- Resource protection: pools and max active runs per DAG.
- Event-driven + scheduled hybrid: trigger on file arrival within a daily time window.
- Reproducible backfills: pin code/config version used for the run; keep idempotent SQL (merge/replace).
Exercises
Do these, then compare with the solutions below. A simple checklist helps you self-review.
Exercise 1 — Daily pipeline with dependencies
Design a daily pipeline that runs at 06:30 in your local timezone with the following tasks: 1) snapshot_sources, 2) models_stg, 3) models_mart, 4) tests. Requirements:
- Order: snapshot_sources → models_stg → models_mart → tests
- Catchup disabled
- Retries: 2, exponential backoff starting at 5 minutes
- Timeout: 40 minutes per task; DAG SLA: 2 hours
- Max concurrent active runs: 1
Show solution
Schedule: 30 6 * * * (local timezone)
Catchup: false
DAG: snapshot_sources -> models_stg -> models_mart -> tests
Retries: 2, backoff 5m exponential (5m, 10m)
Timeouts: 40m per task; SLA: 2h for the whole run
Concurrency: 1 (queue extra triggers)
Exercise 2 — Wait-then-backfill
Your raw data for ds arrives around 02:05 UTC. The transform should start no earlier than 02:10 UTC and only if ds is present. On missing ds after 25 minutes, skip and alert. You also need a 14-day backfill that runs at most 2 partitions in parallel.
Show solution
Schedule: 10 2 * * * (UTC)
Sensor: wait for raw partition ds; check every 2m up to 25m
On timeout: skip downstream; alert owner
Backfill: ds in [D-14, D-1], concurrency=2, idempotent writes (merge/replace by ds)
Self-check checklist
- Did you define a timezone and exact trigger time?
- Are dependencies strictly acyclic and complete?
- Do retries and timeouts exist for every critical task?
- Is catchup explicitly set and justified?
- Are concurrency and resource limits clear?
- Is there a plan for late/missing upstream data?
- Is the backfill strategy idempotent and bounded?
Common mistakes and how to self-check
- Missing timezone: schedules drift or misalign with data arrival. Self-check: explicitly state timezone next to cron.
- Implicit dependencies: tasks start too early. Self-check: draw the DAG; every task has defined upstreams.
- No readiness checks: transforms run on empty tables. Self-check: add sensors or checks for table/partition existence.
- Overlapping runs: data races or lock contention. Self-check: set max active runs and task-level concurrency.
- Non-idempotent SQL: duplicate rows on retries/backfills. Self-check: use merge/replace by partition keys.
- Unbounded backfills: starve production resources. Self-check: limit parallelism and time windows.
- Silent failures: no alerts on skips/timeouts. Self-check: define alerts on failure, timeout, or SLA miss.
Practical projects
- Project 1: Build a daily sales pipeline with sensors that wait for raw partitions, plus SLA alerts. Include a 7-day backfill script.
- Project 2: Fan-out product models per region, fan-in to a global mart, with pool-based concurrency limits.
- Project 3: Add a late-data fallback that uses yesterday’s partition and marks downstream reports with a freshness flag.
Who this is for
Analytics Engineers, BI Developers, and Data Analysts who schedule dbt/SQL workflows and need reliable, predictable pipelines.
Prerequisites
- Comfort with SQL and incremental/model-based transforms.
- Basic understanding of DAG concepts and cron schedules.
- Familiarity with a workflow tool (e.g., any orchestrator) helps but is not required.
Learning path
- Understand DAGs and scheduling basics.
- Add readiness checks (sensors) and retries.
- Control concurrency and pools.
- Implement backfills with idempotent writes.
- Set SLAs, alerts, and monitoring.
Next steps
- Refactor one of your pipelines to make dependencies explicit.
- Add a partition sensor and a bounded backfill job.
- Introduce SLAs and alerts on the most critical dataset.
Mini challenge
Design a weekly pipeline that runs every Monday 05:00 UTC, building a summary table from seven daily partitions. Include: explicit dependencies, readiness checks, retry/backoff, timeout/SLA, catchup policy, and resource limits. Keep it idempotent.
Quick test
Everyone can take the test for free. Only logged-in users will see saved progress when they return.