How to learn Scheduling And Orchestration for ETL Developer for free

Why this skill matters for ETL Developers

Scheduling and orchestration turn your ETL code into reliable, repeatable pipelines. You define when jobs run, in what order, how to recover from failures, and how to backfill historical data. Done well, this reduces late data, paging at 3am, and business risk. You will use these skills daily to run batch loads, maintain SLAs, and coordinate tasks across storage, compute, and analytics tools.

What you'll learn

Designing job schedules (cron and event-based), aligning time zones and SLAs.
Expressing dependencies so tasks run in the correct order.
Retries, alerts, and failure recovery patterns that keep pipelines resilient.
Backfills and controlled reruns without corrupting data.
Parameterizing runs by date partition for partitioned tables and incremental loads.
Monitoring, SLAs, and actionable runbooks for smooth operations.

Who this is for

ETL Developers and Data Engineers building batch/incremental pipelines.
Analysts/Analytics Engineers who schedule dbt or SQL transformations.
Ops-minded developers who want reliable data delivery.

Prerequisites

Comfort writing SQL and basic Python or shell scripting.
Basic understanding of data warehouses/lakes and partitioned tables.
Familiarity with at least one orchestrator (e.g., Airflow/Prefect/dbt Cloud) helps, but examples are self-contained.

Learning path

1. Foundations: Learn cron syntax, time zones, business calendars, and SLAs. Practice translating schedules to human language.

2. Dependencies: Create small DAGs with clear upstream/downstream, sensors or waits, and concurrency limits.

3. Resilience: Add retries with backoff, alerts, and idempotent write patterns (MERGE/UPSERT).

4. Backfills: Run safe historical backfills and partial reruns. Validate with row counts and checksums.

5. Parameterization: Pass run dates/partitions into tasks; template paths and queries.

6. Operate: Add SLAs, dashboards, and a runbook. Do a game-day recovery simulation.

Worked examples

Example 1: Cron basics and time zones

Goal: Run a job every weekday at 02:15 UTC.

# Cron format: minute hour day-of-month month day-of-week
15 2 * * 1-5    /usr/local/bin/run_etl.sh >> /var/log/etl.log 2>&1

Interpretation: At 02:15 UTC, Monday–Friday.
Time zones: If stakeholders expect 2:15 AM in New York, adjust for DST or run orchestrator in a fixed TZ (e.g., UTC) and communicate local times.

Example 2: Airflow DAG with dependencies, retries, and SLA

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator

def ds_to_nodash(ds):
    return ds.replace('-', '')

with DAG(
    dag_id="daily_sales_pipeline",
    start_date=datetime(2024, 1, 1),
    schedule="15 2 * * *",  # 02:15 UTC daily
    catchup=True,
    max_active_runs=1,
    default_args={
        "retries": 3,
        "retry_delay": timedelta(minutes=10),
        "email_on_failure": True,
        # Airflow sends SLA miss notifications; tasks won't fail solely due to SLA
    },
) as dag:

    extract = BashOperator(
        task_id="extract",
        bash_command="python etl_extract.py --date {{ ds }}",
        sla=timedelta(hours=4),
    )

    transform = BashOperator(
        task_id="transform",
        bash_command=(
            "python etl_transform.py --date {{ ds }} --out /data/curated/{{ ds }}"
        ),
    )

    load = BashOperator(
        task_id="load",
        bash_command=(
            "python etl_load.py --date {{ ds }} --path /data/curated/{{ ds }}"
        ),
    )

    extract >> transform >> load

Dependencies: extract -> transform -> load.
Retries: 3 times with 10 minutes delay.
SLA: 4 hours to complete extract; SLA misses will alert.
catchup=True enables backfilling missed days from start_date.

Example 3: Parameterizing by date partition (Jinja templates)

# Using Airflow templating variables
db_path = "/warehouse/sales/dt={{ ds_nodash }}"  # 20240131
query = (
    "MERGE INTO sales_curated t USING staging s ON t.id=s.id "
    "WHEN MATCHED THEN UPDATE SET t.amount=s.amount, t.dt='{{ ds }}' "
    "WHEN NOT MATCHED THEN INSERT (id, amount, dt) VALUES (s.id, s.amount, '{{ ds }}')"
)

Use {{ ds }} (YYYY-MM-DD) for partition columns and {{ ds_nodash }} (YYYYMMDD) for file paths.
Make your tasks idempotent so reruns do not duplicate rows.

Example 4: Safe backfill and partial rerun

# Airflow CLI backfill example
# Backfill Jan 1 to Jan 7 (inclusive)
airflow dags backfill -s 2024-01-01 -e 2024-01-07 daily_sales_pipeline

# Rerun only a failed task for a specific execution date
airflow tasks clear daily_sales_pipeline transform \
  --start-date 2024-01-03 --end-date 2024-01-03 --only-running --downstream False

Validate after backfill: compare row counts and checksums to expected totals.
Protect with idempotent loads (MERGE/UPSERT) and partition overwrite.

Example 5: Failure handling and alerts

from airflow.operators.python import PythonOperator
import random

def flaky():
    if random.random() < 0.5:
        raise RuntimeError("Transient API error")

flaky_task = PythonOperator(
    task_id="flaky_task",
    python_callable=flaky,
    retries=5,
    retry_exponential_backoff=True,
    max_retry_delay=timedelta(minutes=30),
)

Use exponential backoff and a max retry delay to smooth transient issues.
Send alerts on final failure; add context (run id, partition, logs) to help triage fast.

Example 6: Concurrency controls and dependencies across datasets

# Limit parallel runs and prevent overlap
with DAG(
    dag_id="inventory_sync",
    schedule="0 * * * *",
    max_active_runs=1,  # no overlapping hourly runs
    concurrency=8,      # cap total running tasks
    catchup=False,
) as dag:
    ...

max_active_runs=1 prevents overlapping windows.
Use sensors or explicit dependency signals to ensure upstream datasets are ready.

Drills and exercises

[ ] Translate these cron expressions to plain English: "0 2 * * *", "30 6 * * 1-5", "0 */3 * * *".
[ ] Write a DAG with three tasks A -> B -> C. Add retries=2 for B only.
[ ] Add parameterization by date: pass {{ ds }} into a SQL query that updates a dt partition.
[ ] Simulate a failure (exit 1) and verify your alert fires with helpful context.
[ ] Run a controlled backfill for the last 7 days. Prove idempotency with counts before/after.
[ ] Add an SLA (2 hours) to a task and confirm how SLA misses are reported in your tool.

Common mistakes and debugging tips

Time zone drift: Running in local time causes DST surprises. Standardize on UTC and translate for stakeholders.
Non-idempotent writes: INSERT-only reruns duplicate rows. Use MERGE/UPSERT or partition overwrite.
Hidden dependencies: Relying on timing instead of explicit signals leads to race conditions. Declare dependencies or use sensors/ready flags.
Overlapping runs: Missing max_active_runs or concurrency limits causes contention. Cap concurrency and prevent overlap.
Backfilling with changed logic: Code drift between historical and current runs skews numbers. Tag versions and document backfill logic.
Alert noise: Alerts without context get ignored. Include run date, task id, environment, and a link to logs. Route to the right channel.

Debugging playbook (quick wins)

Check if the task is retrying or truly failed. If transient, increase backoff.
Reproduce locally with the same parameters (e.g., date). Compare logs.
Query target partitions to confirm duplicates or gaps.
Audit upstream completeness before rerunning downstream tasks.
Use a dry-run mode or limit scope to one partition when testing fixes.

Practical projects

Daily Sales Pipeline: Extract from API, transform, and MERGE into a warehouse partitioned by dt with SLA and alerts.
Hourly Inventory Sync: Prevent overlaps with max_active_runs=1 and use a sensor to wait for upstream file arrival.
7-Day Historical Backfill: Backfill a metric table safely; validate and document the process in a runbook.

Mini project: Daily Sales Pipeline with Backfill and SLA

1. Schedule: Run daily at 02:15 UTC. Document the local-time equivalent for stakeholders.

2. DAG: Tasks: extract -> transform -> load. Add retries and exponential backoff to extract.

3. Parameterization: Use {{ ds }} for partition columns and {{ ds_nodash }} for paths.

4. Idempotency: Implement MERGE to avoid duplicates on reruns.

5. SLA & Monitoring: 4-hour SLA on extract; add alerting on failure and SLA miss.

6. Backfill: Backfill the last 14 days; validate counts and totals vs. source.

7. Runbook: Include failure triage steps, log locations, rollback/overwrite instructions, and escalation.

Subskills

Job Scheduling Basics: Cron, time zones, calendars, avoiding overlaps.
Dependency Management: Upstream/downstream, sensors, data readiness signals.
Retries And Alerts: Backoff, jitter, failure notifications with context.
Backfills And Reruns: Historical loads with idempotent writes and validation.
Parameterization By Date Partition: Templating run dates into queries and paths.
Handling Failures And Recovery: Resuming safely, partial clears, and state reconciliation.
SLAs And Monitoring: Timeliness tracking, dashboards, and alert routing.
Runbooks For Operations: Step-by-step procedures for common incidents.

Next steps

Complete the drills, then build the mini project end-to-end.
Take the Skill Exam below to check your readiness.
Iterate on your runbook after your first real incident or backfill.

Menu

Scheduling And Orchestration

Table of Contents

Why this skill matters for ETL Developers

What you'll learn

Who this is for

Prerequisites

Learning path

Worked examples

Drills and exercises

Common mistakes and debugging tips

Practical projects

Mini project: Daily Sales Pipeline with Backfill and SLA

Subskills

Next steps

Scheduling And Orchestration — Skill Exam

Topics

Parameterization By Date Partition

SLAs And Monitoring

Job Scheduling Basics

Dependency Management

Retries And Alerts

Backfills And Reruns

Handling Failures And Recovery

Runbooks For Operations

Have questions about Scheduling And Orchestration?

AI Assistant