Who this is for
- ETL Developers and Data Engineers who need reliable, predictable pipelines.
- Analysts transitioning to building scheduled data jobs.
- Anyone working with cron, orchestrators (e.g., Airflow, Prefect), or cloud schedulers.
Prerequisites
- Basic command-line familiarity.
- Understanding of ETL/ELT steps and batch vs. streaming.
- Comfort with reading logs and simple error messages.
Why this matters
In real teams, data must arrive on time. Finance needs daily revenue by 7:00, marketing wants hourly user events, and machine learning retrains weekly at low traffic hours. Scheduling turns your code into dependable operations: it defines when to run, what runs first, how failures retry, and how to avoid overlaps.
- You’ll coordinate dependencies (e.g., load sales only after raw files arrive).
- You’ll handle time zones and daylight saving changes without surprises.
- You’ll control retries, timeouts, concurrency, and alerts to keep SLAs green.
Concept explained simply
A scheduler decides when a job should start. An orchestrator coordinates many jobs with dependencies and rules. Think of the scheduler as the clock and the orchestrator as the conductor.
Mental model
Imagine a train schedule:
- Time tables = cron expressions or intervals (when).
- Switches = dependencies (only go if the previous train arrived).
- Signals = concurrency limits (prevent two trains on the same track).
- Delays and re-routing = retries and backoff.
- Control room = monitoring, SLAs, and alerts.
Core concepts and terminology
- Time-based scheduling: cron expressions (minute hour day-of-month month day-of-week). Example: 15 2 * * * means 02:15 every day.
- Intervals: run every N minutes/hours (e.g., every 15 minutes).
- Time zones & DST: prefer UTC for schedules; convert timestamps at the edges. Daylight saving can cause skips or double-runs.
- Dependencies: ensure upstream jobs or data partitions exist before running. DAGs represent these relationships.
- Retries & backoff: automatic re-attempts with delays (e.g., 3 retries, exponential backoff).
- Concurrency & locking: limit parallel runs (e.g., only 1 active run) to avoid overlaps.
- Timeouts & SLAs: fail or alert if a job exceeds a runtime limit; track if data is late.
- Idempotency: running the same task twice yields the same final state (safe retries and backfills).
- Calendars & holidays: pause/skip on defined calendars (e.g., business days only).
- Event-driven triggers: start when an event happens (file arrival, message), often combined with time-based safety windows.
- Monitoring & alerting: logs, metrics, and notifications on failure/latency.
Worked examples
Example 1: Nightly sales ETL with file arrival
- Goal: Load sales daily at 02:15 after vendor file arrives (~01:30).
- Schedule: 15 2 * * * (use UTC if possible, e.g., 02:15 UTC).
- Dependency: Wait for file sensor until 02:45; if not present, retry sensor every 5 minutes for 1 hour.
- Timeout: Total job timeout 90 minutes.
- Alert: Page if not finished by 04:00 (SLA breach).
Why it works
Fixed time + sensor ensures you don’t load partial data. SLA gives a clear “late” signal.
Example 2: Hourly incremental pipeline with backfill safety
- Goal: Ingest user events hourly at minute 10, process last closed hour.
- Schedule: 10 * * * *.
- Windowing: Each run reads [H-1, H) based on the run time, not system time.
- Concurrency: Limit to 1 active run to avoid overlapping windows.
- Retries: 5 attempts, exponential backoff starting at 2 minutes.
- Idempotency: Upsert into target partition for hour H-1; reruns are safe.
Why it works
Aligning processing to closed time windows prevents double-counting and enables safe retries/backfills.
Example 3: Weekly full refresh skipping maintenance
- Goal: Full refresh Sunday 04:00, but skip planned maintenance Sundays of month 1 and 7.
- Schedule: 0 4 * * 0 (every Sunday 04:00), plus a calendar that excludes maintenance days; or move to Monday 04:00 on those dates.
- Timeout: 4 hours; alerts on overrun.
- DST: Keep in UTC to avoid shifting window.
Why it works
Explicit calendar logic prevents conflicts with maintenance. UTC avoids DST surprises.
Practical steps: design a reliable schedule
- Define freshness: when must data be ready? Set SLA (e.g., by 07:00 daily).
- Choose trigger: time-based (cron/interval) and/or event-based (file arrived). Add a max wait.
- Pick time zone: default to UTC. If business-hour specific, convert at the edges.
- Define windowing: process a closed interval (e.g., previous hour/day).
- Set safety controls: retries with backoff, timeouts, concurrency=1 for non-idempotent steps.
- Plan failures: clear alerts, rerun strategy, and idempotent writes.
- Document: cron, dependencies, retry policy, SLA, and runbook for on-call.
Exercises
These mirror the exercises in the Exercises panel below.
Exercise 1: Turn a requirement into a schedule
Requirement: “Load finance KPIs at 06:05 Monday–Friday. Skip public holidays. If a run fails, retry up to 3 times with 10-minute gaps. Alert if runtime exceeds 45 minutes. Avoid overlapping runs.”
- Deliverables:
- Cron expression (assume UTC).
- Concurrency and retry policy.
- SLA/timeout settings.
- Holiday handling approach.
Hints
- Cron fields order: minute hour day-of-month month day-of-week.
- Concurrency 1 prevents overlaps.
- Use a business-day calendar or skip logic.
Suggested solution
Cron: 5 6 * * 1-5 (06:05 Mon–Fri). Concurrency: 1. Retries: 3 with 10m delay. Timeout: 45m; SLA: ready by 07:00. Holiday skip: maintain a holiday calendar or a pre-check step that exits cleanly if holiday.
Exercise 2: Fix a double-run issue
Symptom: Your job “daily_orders” ran twice around DST fall-back. The second run overwrote data.
- Task: Propose changes to prevent double-runs and protect data.
Hints
- Consider UTC schedule and idempotency.
- Set single active run and partitioned writes.
Suggested solution
Move schedule to UTC. Enforce concurrency=1. Write to date-partitioned targets with upsert/replace, keyed by logical execution date, not wall-clock, so a repeated run updates the same partition deterministically. Add a uniqueness lock or job run key to avoid duplicate triggers.
Checklist: ready to schedule
- Schedule defined in UTC (or documented local with DST plan).
- Dependency checks (file/table sensors) with max wait and alerts.
- Windowing logic tied to run time (execution date).
- Retries with backoff and clear max attempts.
- Timeout per task and overall job SLA.
- Concurrency/locking to avoid overlaps.
- Idempotent writes and backfill plan.
- Monitoring: alerts on failure and lateness.
Common mistakes and how to self-check
- DST surprises: Local time schedules cause skips/doubles. Self-check: Does your job ever run twice or not at all on DST change? Fix: Use UTC or explicit DST handling.
- Overlapping runs: No locking. Self-check: Any two runs of the same job at once? Fix: Set concurrency=1 and use run keys.
- Non-idempotent writes: Appends duplicate rows. Self-check: Rerun the same execution date; do results change? Fix: Upsert/replace by partition/key.
- Missing dependencies: Processing before data lands. Self-check: Add sensors and max wait with alerts.
- No timeouts: Jobs hang forever. Self-check: Do any runs exceed historical p95 runtime without failing? Add timeouts.
Practical projects
- Project A: Build an hourly ingestion job that processes [H-1, H) with retries and concurrency=1; validate idempotency by rerunning the same hour.
- Project B: Create a daily DAG with a file sensor, a transform, and a load step, with an SLA alert if not done by 08:00.
- Project C: Implement a holiday-aware weekly job that skips on a given calendar but backfills the next business day.
Mini challenge
You inherit a pipeline that runs “0 0 * * *” local time and often double-loads on DST. In one paragraph, describe a migration plan to UTC with minimal downtime, including validation steps and a rollback plan.
Learning path
- Start: Job Scheduling Basics (this page).
- Next: Dependencies and Sensors in Orchestrators.
- Then: Retries, Backoff, and Idempotency Patterns.
- Advanced: SLAs, Observability, and On-call Runbooks.
Next steps
- Apply the checklist to one of your existing jobs.
- Configure alerts and timeouts for a critical pipeline.
- Run a controlled backfill to validate idempotency.
Quick Test and progress
Take the Quick Test below to check your understanding. Available to everyone; only logged-in users will have test progress saved automatically.