Why this skill matters for Analytics Engineers
Orchestration turns your analytics code into reliable, repeatable data products. It schedules jobs, enforces task order, handles retries, alerts, and backfills, and keeps data fresh for stakeholders and downstream tools. Mastering orchestration lets you ship trustworthy models, keep SLAs, and integrate dbt and BI pipelines into one dependable system.
Who this is for
- Analytics Engineers moving from ad-hoc scripts to production pipelines.
- Data Analysts who want reliable daily tables and freshness guarantees.
- BI Developers who need dependable upstream datasets.
Prerequisites
- Comfortable with SQL and a data warehouse (e.g., writing SELECTs, creating tables).
- Basic Python familiarity (reading simple scripts, editing config).
- Understanding of dbt basics (models, tests, build/run) is helpful.
What you will be able to do
- Design DAGs that express clear task dependencies.
- Schedule reliable daily/hourly jobs with safe retries and alerts.
- Run parameterized and partition-aware pipelines.
- Handle backfills without duplicating data or breaking SLAs.
- Monitor runs, track freshness, and integrate dbt into orchestrators.
Learning path
- Scheduling & Dependencies: Create a daily job and define upstream/downstream tasks.
- DAG Concepts & Task Design: Keep tasks small, idempotent, and data-partition aware.
- Retries & Alerts: Add exponential backoff, set max retries, and notify on failures.
- Backfills: Reprocess only missing partitions safely and verify row counts.
- Parameters & Partitions: Pass run dates and use them in SQL/dbt.
- Monitoring: Review logs, durations, success rates; add sanity checks.
- SLAs & Freshness: Define expectations, measure delays, and report status.
- dbt Integration: Orchestrate dbt build/test with artifacts captured.
Mini tasks to practice (open)
- Create a daily schedule at 02:00 with a two-task dependency.
- Add retries with exponential backoff and jitter.
- Parameterize the run date and use it in a WHERE clause on a partition key.
- Run a backfill for the last 3 days without duplicating data.
- Trigger a dbt build as a downstream task; save run artifacts.
Worked examples
1) Airflow: Daily DAG with retries, SLA, and alert
# airflow/dags/daily_sales.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
DEFAULT_ARGS = {
"owner": "data",
"retries": 3,
"retry_delay": timedelta(minutes=5),
"retry_exponential_backoff": True,
}
def extract(**ctx):
# Pull data for the dag_run logical date (the partition)
run_date = ctx["data_interval_start"].date().isoformat()
print(f"Extracting sales for {run_date}")
def load(**ctx):
run_date = ctx["data_interval_start"].date().isoformat()
# Implement idempotent load (MERGE/UPSERT by partition_date)
print(f"Loading partition {run_date} with MERGE")
with DAG(
dag_id="daily_sales",
start_date=datetime(2024, 1, 1),
schedule_interval="0 2 * * *", # 02:00 daily
catchup=False,
default_args=DEFAULT_ARGS,
sla=timedelta(minutes=45), # DAG-level SLA
max_active_runs=1,
) as dag:
t1 = PythonOperator(task_id="extract", python_callable=extract)
t2 = PythonOperator(task_id="load", python_callable=load)
t1 >> t2
Key points: daily schedule, retries with backoff, idempotent load per partition, one active run to avoid contention, SLA to flag lateness.
2) Prefect: Parameterized flow with a CRON schedule
# prefect_flow.py
from datetime import date
from prefect import flow, task
@task(retries=3, retry_delay_seconds=300)
def transform(ds: str):
# ds is YYYY-MM-DD; use it in SQL filters
print(f"Transform for partition {ds}")
@flow
def daily_pipeline(ds: str = date.today().isoformat()):
transform.submit(ds)
if __name__ == "__main__":
# Local param run
daily_pipeline()
Deploy this flow with a CRON (e.g., 0 2 * * *) and pass ds as a parameter. Keep tasks small and idempotent.
3) Safe backfill with partitioned MERGE
-- Idempotent load for one day partition
MERGE INTO analytics.sales_daily tgt
USING (
SELECT CAST('{{ ds }}' AS DATE) AS partition_date, SUM(amount) AS total
FROM raw.sales
WHERE sale_date = CAST('{{ ds }}' AS DATE)
GROUP BY 1
) src
ON tgt.partition_date = src.partition_date
WHEN MATCHED THEN UPDATE SET total = src.total
WHEN NOT MATCHED THEN INSERT (partition_date, total) VALUES (src.partition_date, src.total);
Run once per partition date. On backfill, rerunning yields the same result—no duplicates.
4) Orchestrating dbt as a task
# Example: Airflow BashOperator to run dbt build and save artifacts
from airflow.operators.bash import BashOperator
run_dbt = BashOperator(
task_id="dbt_build",
bash_command=(
"cd /opt/dbt_project && "
"dbt build --profiles-dir . --target prod "
"--vars 'run_date={{ ds }}' && "
"echo 'Artifacts in target/'"
),
)
Upstream tasks prepare data; dbt builds models; artifacts (manifest, run results) can be parsed later for lineage and monitoring.
5) Monitoring, freshness, and lightweight checks
# Simple row-count check after load
from airflow.operators.python import PythonOperator
def assert_min_rows(min_rows: int, **ctx):
# Replace with a real query to your warehouse
row_count = 1200 # example from a query result
if row_count < min_rows:
raise ValueError(f"Row count {row_count} < {min_rows}")
check_rows = PythonOperator(
task_id="check_rows",
python_callable=assert_min_rows,
op_kwargs={"min_rows": 1000},
)
Combine SLA, task durations, success ratio, and sanity checks (e.g., row counts, max timestamp). Use alerts to notify on failures or SLA misses.
Drills and exercises
- Create a DAG/flow with three tasks: extract → transform → load; enforce dependencies.
- Add retries (3) with exponential backoff; simulate a transient failure to confirm retry.
- Parameterize the run date and use it to filter a partition in SQL.
- Turn on catchup and backfill the last 5 days safely using MERGE/UPSERT.
- Add a row-count check and fail the run if below a threshold.
- Measure run duration; note a baseline; alert if it increases by 2× over a week.
- Integrate a dbt build step and store artifacts for later analysis.
Common mistakes and debugging tips
- Mixing business logic and orchestration logic in one giant task. Tip: keep tasks small, single-purpose, and idempotent.
- Using wall-clock time inside SQL instead of the run's logical date. Tip: pass run parameters (ds) explicitly.
- Reprocessing the same partition without idempotency. Tip: use MERGE/UPSERT by partition key.
- Infinite retries on permanent failures. Tip: cap retries and surface alerts.
- Ignoring data freshness. Tip: track max event timestamp vs. now; alert when late beyond SLA.
- Backfilling with overlapping windows. Tip: process smallest safe partition (day/hour) and verify counts after.
- No concurrency limits. Tip: set max_active_runs and task-level concurrency to avoid warehouse overload.
Mini project: Daily partitioned pipeline with dbt
Goal: Build a daily pipeline that loads raw events into a partitioned table, runs dbt models, checks freshness, and alerts on issues.
Step 1 — Define the schedule and parameters
- Schedule: 02:00 daily; disable catchup initially.
- Parameter: ds (YYYY-MM-DD) used in all SQL and dbt vars.
Step 2 — Extract and stage
- Create an extract task that reads only ds partition from raw.
- Write to a staging table keyed by ds.
Step 3 — Transform with idempotent MERGE
- Transform aggregated metrics per ds.
- Use MERGE/UPSERT targeting the ds key.
Step 4 — Run dbt build
- Run dbt build with a var run_date = ds.
- Store artifacts (manifest, run_results.json) for review.
Step 5 — Freshness and quality checks
- Check row counts and latest timestamps in critical models.
- Fail the run if thresholds are not met.
Step 6 — Backfill and observe
- Enable catchup and backfill the last 7 days.
- Verify no duplicates; compare counts between raw and modeled.
- Review durations, failures, and alerts; tune retries.
Subskills
- Scheduling And Dependencies — Build reliable schedules and define upstream/downstream tasks clearly.
- DAG Concepts And Task Design — Express your pipeline as small, idempotent tasks with explicit data intervals.
- Retry Policies And Alerts — Use capped retries, exponential backoff, and actionable notifications.
- Handling Backfills Safely — Recompute past partitions without duplication or gaps.
- Parameterized Runs And Partitions — Pass run parameters and target partitioned tables/models.
- Monitoring And Observability Basics — Track logs, durations, success rates, and simple data checks.
- Managing SLAs And Freshness — Define expectations and alert when late or stale.
- Integrating dbt Runs Into Orchestrators — Trigger dbt build/test and capture artifacts.
Next steps
- Practice the drills, then complete the mini project.
- Review the subskills that felt weakest.
- Take the skill exam below. The exam is available to everyone; only logged-in users will have their progress saved.