Why this skill matters for ETL Developers
Scheduling and orchestration turn your ETL code into reliable, repeatable pipelines. You define when jobs run, in what order, how to recover from failures, and how to backfill historical data. Done well, this reduces late data, paging at 3am, and business risk. You will use these skills daily to run batch loads, maintain SLAs, and coordinate tasks across storage, compute, and analytics tools.
What you'll learn
- Designing job schedules (cron and event-based), aligning time zones and SLAs.
- Expressing dependencies so tasks run in the correct order.
- Retries, alerts, and failure recovery patterns that keep pipelines resilient.
- Backfills and controlled reruns without corrupting data.
- Parameterizing runs by date partition for partitioned tables and incremental loads.
- Monitoring, SLAs, and actionable runbooks for smooth operations.
Who this is for
- ETL Developers and Data Engineers building batch/incremental pipelines.
- Analysts/Analytics Engineers who schedule dbt or SQL transformations.
- Ops-minded developers who want reliable data delivery.
Prerequisites
- Comfort writing SQL and basic Python or shell scripting.
- Basic understanding of data warehouses/lakes and partitioned tables.
- Familiarity with at least one orchestrator (e.g., Airflow/Prefect/dbt Cloud) helps, but examples are self-contained.
Learning path
Worked examples
Example 1: Cron basics and time zones
Goal: Run a job every weekday at 02:15 UTC.
# Cron format: minute hour day-of-month month day-of-week
15 2 * * 1-5 /usr/local/bin/run_etl.sh >> /var/log/etl.log 2>&1
- Interpretation: At 02:15 UTC, Monday–Friday.
- Time zones: If stakeholders expect 2:15 AM in New York, adjust for DST or run orchestrator in a fixed TZ (e.g., UTC) and communicate local times.
Example 2: Airflow DAG with dependencies, retries, and SLA
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
def ds_to_nodash(ds):
return ds.replace('-', '')
with DAG(
dag_id="daily_sales_pipeline",
start_date=datetime(2024, 1, 1),
schedule="15 2 * * *", # 02:15 UTC daily
catchup=True,
max_active_runs=1,
default_args={
"retries": 3,
"retry_delay": timedelta(minutes=10),
"email_on_failure": True,
# Airflow sends SLA miss notifications; tasks won't fail solely due to SLA
},
) as dag:
extract = BashOperator(
task_id="extract",
bash_command="python etl_extract.py --date {{ ds }}",
sla=timedelta(hours=4),
)
transform = BashOperator(
task_id="transform",
bash_command=(
"python etl_transform.py --date {{ ds }} --out /data/curated/{{ ds }}"
),
)
load = BashOperator(
task_id="load",
bash_command=(
"python etl_load.py --date {{ ds }} --path /data/curated/{{ ds }}"
),
)
extract >> transform >> load
- Dependencies: extract -> transform -> load.
- Retries: 3 times with 10 minutes delay.
- SLA: 4 hours to complete extract; SLA misses will alert.
- catchup=True enables backfilling missed days from start_date.
Example 3: Parameterizing by date partition (Jinja templates)
# Using Airflow templating variables
db_path = "/warehouse/sales/dt={{ ds_nodash }}" # 20240131
query = (
"MERGE INTO sales_curated t USING staging s ON t.id=s.id "
"WHEN MATCHED THEN UPDATE SET t.amount=s.amount, t.dt='{{ ds }}' "
"WHEN NOT MATCHED THEN INSERT (id, amount, dt) VALUES (s.id, s.amount, '{{ ds }}')"
)
- Use {{ ds }} (YYYY-MM-DD) for partition columns and {{ ds_nodash }} (YYYYMMDD) for file paths.
- Make your tasks idempotent so reruns do not duplicate rows.
Example 4: Safe backfill and partial rerun
# Airflow CLI backfill example
# Backfill Jan 1 to Jan 7 (inclusive)
airflow dags backfill -s 2024-01-01 -e 2024-01-07 daily_sales_pipeline
# Rerun only a failed task for a specific execution date
airflow tasks clear daily_sales_pipeline transform \
--start-date 2024-01-03 --end-date 2024-01-03 --only-running --downstream False
- Validate after backfill: compare row counts and checksums to expected totals.
- Protect with idempotent loads (MERGE/UPSERT) and partition overwrite.
Example 5: Failure handling and alerts
from airflow.operators.python import PythonOperator
import random
def flaky():
if random.random() < 0.5:
raise RuntimeError("Transient API error")
flaky_task = PythonOperator(
task_id="flaky_task",
python_callable=flaky,
retries=5,
retry_exponential_backoff=True,
max_retry_delay=timedelta(minutes=30),
)
- Use exponential backoff and a max retry delay to smooth transient issues.
- Send alerts on final failure; add context (run id, partition, logs) to help triage fast.
Example 6: Concurrency controls and dependencies across datasets
# Limit parallel runs and prevent overlap
with DAG(
dag_id="inventory_sync",
schedule="0 * * * *",
max_active_runs=1, # no overlapping hourly runs
concurrency=8, # cap total running tasks
catchup=False,
) as dag:
...
- max_active_runs=1 prevents overlapping windows.
- Use sensors or explicit dependency signals to ensure upstream datasets are ready.
Drills and exercises
- [ ] Translate these cron expressions to plain English: "0 2 * * *", "30 6 * * 1-5", "0 */3 * * *".
- [ ] Write a DAG with three tasks A -> B -> C. Add retries=2 for B only.
- [ ] Add parameterization by date: pass {{ ds }} into a SQL query that updates a dt partition.
- [ ] Simulate a failure (exit 1) and verify your alert fires with helpful context.
- [ ] Run a controlled backfill for the last 7 days. Prove idempotency with counts before/after.
- [ ] Add an SLA (2 hours) to a task and confirm how SLA misses are reported in your tool.
Common mistakes and debugging tips
- Time zone drift: Running in local time causes DST surprises. Standardize on UTC and translate for stakeholders.
- Non-idempotent writes: INSERT-only reruns duplicate rows. Use MERGE/UPSERT or partition overwrite.
- Hidden dependencies: Relying on timing instead of explicit signals leads to race conditions. Declare dependencies or use sensors/ready flags.
- Overlapping runs: Missing max_active_runs or concurrency limits causes contention. Cap concurrency and prevent overlap.
- Backfilling with changed logic: Code drift between historical and current runs skews numbers. Tag versions and document backfill logic.
- Alert noise: Alerts without context get ignored. Include run date, task id, environment, and a link to logs. Route to the right channel.
Debugging playbook (quick wins)
- Check if the task is retrying or truly failed. If transient, increase backoff.
- Reproduce locally with the same parameters (e.g., date). Compare logs.
- Query target partitions to confirm duplicates or gaps.
- Audit upstream completeness before rerunning downstream tasks.
- Use a dry-run mode or limit scope to one partition when testing fixes.
Practical projects
- Daily Sales Pipeline: Extract from API, transform, and MERGE into a warehouse partitioned by dt with SLA and alerts.
- Hourly Inventory Sync: Prevent overlaps with max_active_runs=1 and use a sensor to wait for upstream file arrival.
- 7-Day Historical Backfill: Backfill a metric table safely; validate and document the process in a runbook.
Mini project: Daily Sales Pipeline with Backfill and SLA
Subskills
- Job Scheduling Basics: Cron, time zones, calendars, avoiding overlaps.
- Dependency Management: Upstream/downstream, sensors, data readiness signals.
- Retries And Alerts: Backoff, jitter, failure notifications with context.
- Backfills And Reruns: Historical loads with idempotent writes and validation.
- Parameterization By Date Partition: Templating run dates into queries and paths.
- Handling Failures And Recovery: Resuming safely, partial clears, and state reconciliation.
- SLAs And Monitoring: Timeliness tracking, dashboards, and alert routing.
- Runbooks For Operations: Step-by-step procedures for common incidents.
Next steps
- Complete the drills, then build the mini project end-to-end.
- Take the Skill Exam below to check your readiness.
- Iterate on your runbook after your first real incident or backfill.