Why this matters
As a Machine Learning Engineer, your models rely on timely data. Orchestration makes sure feature generation, model training, and evaluations happen in the right order, on the right schedule, with safe retries and visibility. This turns scripts into reliable, production-grade pipelines.
- Automate nightly feature computation and model retraining.
- Run data quality checks before training to avoid bad models.
- Backfill historical runs when logic changes.
- Set alerts and retries so transient failures don’t break SLAs.
Who this is for
- ML/AI and data engineers moving from notebooks to production pipelines.
- Analysts automating recurring data tasks.
- Developers asked to support periodic jobs and data flows.
Prerequisites
- Comfort with Python or a similar language.
- Basic understanding of data pipelines (extract, transform, load, train, evaluate).
- Familiarity with the command line and cron concepts is helpful but not required.
Concept explained simply
Core terms
- Orchestration: Coordinating tasks so they run in the correct order, with visibility, retries, and logging.
- Scheduler: Decides when a job should run (e.g., every day at 02:15 UTC).
- DAG (Directed Acyclic Graph): A graph describing tasks and their dependencies (no cycles). Example: extract → features → train.
- Idempotency: Running a task multiple times yields the same end state. Critical for retries and backfills.
- Backfill: Running past intervals you missed or re-running history after logic changes.
- SLA: The time by which a task or DAG should finish.
- Time zone: The clock context for schedules. Production often standardizes on UTC.
Mental model
Imagine a train network:
- The scheduler is the timetable.
- Each task is a train that must depart only after previous trains arrive.
- Signals are dependencies; you can’t depart early.
- Idempotency is running the same train twice without double-charging passengers.
- Backfill is replaying missed days’ service to deliver delayed cargo.
Scheduling basics
- Time-based: Run on a fixed schedule (cron or interval).
- Event-based: Run when a file lands, a message appears, or an API signals readiness.
- Hybrid: Time-based window with event sensors (e.g., wait for yesterday’s partition).
Quick cron primer
Cron format: minute hour day_of_month month day_of_week (UTC unless stated). Examples:
- Every day at 02:15:
15 2 * * * - Weekdays at 01:30:
30 1 * * 1-5 - Every 5 minutes:
*/5 * * * *
Worked examples
Example 1: Nightly feature job with cron
Goal: Compute features every day at 02:15 UTC, write to a partition path by date (to ensure idempotency).
# schedule: 15 2 * * * (02:15 UTC daily)
# Command idea (pseudo):
python features.py --run-date {{ ds }} --output s3://bucket/features/date={{ ds }}
# {{ ds }} represents the run's logical date (YYYY-MM-DD).- Idempotency: Overwrite the specific partition each run.
- Retries: If the job fails, re-run safely without duplicating rows.
Example 2: Simple Airflow DAG for daily ML pipeline
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
def noop(task):
print(task)
default_args = {
"owner": "ml",
"retries": 2,
"retry_delay": timedelta(minutes=5),
}
with DAG(
dag_id="daily_ml_pipeline",
start_date=datetime(2024, 1, 1),
schedule="15 2 * * *", # 02:15 UTC daily
catchup=False,
default_args=default_args,
tags=["example"],
) as dag:
extract = PythonOperator(
task_id="extract",
python_callable=lambda: noop("extract"),
execution_timeout=timedelta(minutes=20),
)
prepare = PythonOperator(
task_id="prepare_features",
python_callable=lambda: noop("prepare_features"),
execution_timeout=timedelta(minutes=20),
)
train = PythonOperator(
task_id="train",
python_callable=lambda: noop("train"),
execution_timeout=timedelta(minutes=20),
)
extract >> prepare >> train- Idempotency tip: Write outputs partitioned by run date, not wall-clock time.
- Observability: Failed task retries twice with a 5-minute delay.
Example 3: Event-triggered flow with a time guard
Goal: Start processing when a file for yesterday arrives, but never earlier than 01:00 UTC (upstream system may be late).
- Strategy: Time schedule at 01:00 UTC, first task waits/polls for the file up to 1 hour, then continues; fail if missing.
- Outcome: Safe mix of time and event with a clear timeout.
# Pseudo-logic
schedule = "0 1 * * *" # 01:00 UTC
wait_for_file(max_wait=3600, path=f"/landing/{{ ds }}/data.parquet")
if exists:
run_etl(ds)
else:
fail_and_alert()How to choose a schedule
- Business need: How fresh do features need to be? (hourly, daily, weekly)
- Upstream readiness: When is source data complete?
- Cost and concurrency: Fewer, larger runs vs. many small runs.
- Timezone: Prefer UTC for consistency; convert at presentation time unless business requires local time.
Time zone pitfalls and a simple rule
- Rule: Use UTC schedules. If business requires local time (e.g., New York midnight), ensure the scheduler supports IANA time zones and confirm DST behavior.
- Never rely on server local time. Make it explicit.
Exercises
Do these now. They mirror the graded exercises below.
Exercise 1: Cron for weekdays at 01:30 UTC
Create a cron expression that runs Monday–Friday at 01:30 UTC.
Hint
Use minute hour * * day_of_week with 1-5 for weekdays.
Expected answer
30 1 * * 1-5Exercise 2: Minimal Airflow DAG
Write a minimal Airflow DAG with tasks extract → prepare_features → train, scheduled daily at 02:15 UTC. Include 2 retries with 5-minute delay and a 20-minute execution timeout per task. Make it idempotent by design.
Hint
- Use
default_argsfor retries andexecution_timeoutper task. - Schedule:
15 2 * * *
Example solution
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
def run(step):
print(step)
default_args = {"retries": 2, "retry_delay": timedelta(minutes=5)}
with DAG(
dag_id="ml_daily",
start_date=datetime(2024, 1, 1),
schedule="15 2 * * *",
catchup=False,
default_args=default_args,
):
extract = PythonOperator(task_id="extract", python_callable=lambda: run("extract"), execution_timeout=timedelta(minutes=20))
prep = PythonOperator(task_id="prepare_features", python_callable=lambda: run("prepare_features"), execution_timeout=timedelta(minutes=20))
train = PythonOperator(task_id="train", python_callable=lambda: run("train"), execution_timeout=timedelta(minutes=20))
extract >> prep >> trainIdempotency note: write/read by {{ ds }} partition.
- Checklist
- The schedule matches the requirement.
- Retries and retry delays are set.
- Per-task timeouts exist.
- Outputs are partitioned by the run’s logical date.
Common mistakes and how to self-check
- No idempotency: Reruns duplicate rows or retrain on wrong data. Self-check: Can I safely re-run yesterday’s job without manual cleanup?
- Scheduling before upstream readiness: Runs succeed with incomplete data. Self-check: Do I have a readiness wait or a schedule aligned to source completion?
- Forgetting timeouts: Tasks hang and block workers. Self-check: Does each task have a reasonable execution timeout?
- Over-broad retries: Retrying non-transient errors. Self-check: Do logs indicate transient vs. permanent failure? Consider circuit breakers or fail-fast for bad config.
- Ignoring backfills: Code works only for “today.” Self-check: Can my pipeline run for any historical date parameter?
Quick self-audit
- Can I describe the DAG in one sentence (inputs, order, outputs)?
- Could I re-run last 7 days with one command and get correct results?
- Do I know exactly when (UTC) each task starts and the expected finish time?
Practical projects
- Daily features pipeline: Extract, transform, validate, and publish features to a partitioned store; include a simple data quality gate.
- Weekly retrain and evaluation: Retrain a model on last 90 days, compute metrics, and only promote if metrics improve.
- Backfill tool: A small script to enqueue backfills for a date range while respecting concurrency limits.
Learning path
- Start: Simple cron jobs for one task.
- Move to DAGs with dependencies and retries.
- Add data quality checks and SLAs.
- Introduce event-driven triggers and backfills.
- Harden with idempotency, observability, and resource limits.
Mini challenge
Design a weekly retraining pipeline that runs Sunday at 23:30 Europe/London time, waits for all weekly partitions, retrains, evaluates, and promotes if metrics improve. Specify:
- Schedule (describe timezone behavior).
- Task graph and dependencies.
- Idempotency approach for partitions and model artifacts.
- Retry, timeout, and alert policies.
- How you would backfill the last 8 weeks safely.
Checklist to compare
- Uses explicit timezone or UTC with conversion.
- Partitions by logical week, not wall-clock.
- Promotion is conditional on metrics.
- Backfill does not overwrite promoted models without review.
Next steps
- Implement the exercises in your environment.
- Add a lightweight data validation step before training.
- Practice a 7-day backfill and confirm results match a fresh run.
About the Quick Test
The Quick Test below is available to everyone. If you are logged in, your progress will be saved so you can resume later.