Why this skill matters for MLOps Engineers
Orchestration and workflow engines coordinate the end-to-end lifecycle of ML systems: data ingestion, feature computation, training, validation, model registry updates, and batch/stream scoring. As an MLOps Engineer, you use them to make pipelines reliable, observable, and repeatable across dev/staging/prod. Mastery here unlocks on-time model retraining, safe backfills, data partitioning, event-driven triggers, and enforceable SLAs.
Typical responsibilities unlocked by this skill
- Design DAGs/flows for feature and model pipelines
- Set schedules, parameters, retries, timeouts, and SLAs
- Run backfills and manage partitions for historical consistency
- Trigger pipelines on data or model registry events
- Add observability: logs, metrics, alerts, lineage
- Ship the same workflow across dev/staging/prod safely
What you'll learn and build
- Design production-ready DAGs/flows with clear dependencies
- Schedule jobs and execute safe backfills with partitions
- Harden tasks with retries, timeouts, and SLAs
- Parameterize runs and enable event-driven triggers
- Instrument workflows for observability and debugging
- Promote deployments across multiple environments
Practical roadmap
- Pick an engine and set it up: Airflow, Prefect, or Dagster. Run locally, then containerize.
- Model your pipeline as a DAG/flow: Declare tasks, inputs/outputs, and dependencies. Ensure idempotence.
- Scheduling: Add a cron or interval schedule. Decide on catchup, concurrency, and resource limits.
- Safety rails: Configure retries, retry delays, timeouts, and SLAs.
- Parameters: Pass runtime parameters (e.g., run date, model version) and validate them.
- Event triggers: Add sensors or subscriptions for file/object, table, or message events.
- Partitions and backfills: Partition by date or entity; practice backfilling historical ranges.
- Observability: Add structured logs, metrics, alerts, and lineage. Define clear failure messages.
- Multi-environment promotion: Externalize configuration; promote from dev → staging → prod with the same code.
Cheat sheet: key defaults to set early
- Catchup behavior (enabled/disabled)
- Max concurrent runs per DAG/flow
- Task-level retries and timeouts
- Default parameter values and validation
- Logging format and destination
- Alerting channels and thresholds
Worked examples
1) Airflow DAG with retries, timeouts, and SLA
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
default_args = {
"owner": "mlops",
"retries": 2,
"retry_delay": timedelta(minutes=2),
"execution_timeout": timedelta(minutes=20),
"sla": timedelta(minutes=30),
}
dag = DAG(
dag_id="daily_model_training",
start_date=datetime(2024, 1, 1),
schedule_interval="0 3 * * *", # daily at 03:00
catchup=False,
default_args=default_args,
)
extract = BashOperator(
task_id="extract_features",
bash_command="python scripts/extract.py --date {{ ds }}",
dag=dag,
)
train = BashOperator(
task_id="train_model",
bash_command="python scripts/train.py --date {{ ds }}",
dag=dag,
)
eval = BashOperator(
task_id="evaluate_model",
bash_command="python scripts/evaluate.py --date {{ ds }}",
dag=dag,
)
extract >> train >> eval
Why this works
Retries and timeouts prevent stuck tasks. SLA lets you alert if the run exceeds 30 minutes. Jinja templates pass the run date into your scripts in an idempotent way.
2) Prefect flow with parameters and resilience
from prefect import flow, task, get_run_logger
@task(retries=2, retry_delay_seconds=60, timeout_seconds=300)
def load_data(ds: str):
logger = get_run_logger()
logger.info(f"Loading data for {ds}")
return {"rows": 1000}
@task
def train_model(stats):
return {"model_version": "v1", "rows": stats["rows"]}
@flow
def training_flow(run_date: str = "2024-01-01"):
stats = load_data(run_date)
model = train_model(stats)
return model
if __name__ == "__main__":
training_flow()
Notes
- Parameters make runs reproducible.
- In Prefect 2, schedules and triggers are attached via Deployments. Keep code parametric and stateless.
3) Dagster partitioned asset with daily backfills
from dagster import asset, DailyPartitionsDefinition, AssetExecutionContext
partitions_def = DailyPartitionsDefinition(start_date="2024-01-01")
@asset(partitions_def=partitions_def)
def features(context: AssetExecutionContext):
ds = context.partition_key
context.log.info(f"Building features for {ds}")
# write to partitioned storage, e.g., /features/date={ds}/
# Backfill via CLI: dagster asset backfill --all-features --from 2024-01-01 --to 2024-01-10
Why partitions
Partitions isolate runs by date, making backfills and partial reruns safe and faster.
4) Safe Airflow backfill
# Ensure your DAG is idempotent and depends_on_past is set only if needed.
# Example command:
airflow dags backfill -s 2024-01-01 -e 2024-01-05 daily_model_training
# Tips:
# - Disable side effects on rerun (e.g., use overwrite or upsert)
# - Control concurrency: --reset-dagruns if you want to clear states before rerun
5) Event-driven: trigger on new object in storage (Airflow)
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
wait_for_data = S3KeySensor(
task_id="wait_for_data",
bucket_key="incoming/{{ ds }}/data.parquet",
bucket_name="my-ml-bucket",
poke_interval=60,
timeout=60*30,
soft_fail=False,
dag=dag,
)
wait_for_data >> extract
When to use sensors
Use event sensors when you must wait for external data arrival. Prefer deferrable sensors or asynchronous patterns to avoid resource waste.
6) Environment-aware config
import os
ENV = os.getenv("ENV", "dev")
DATA_BUCKET = {
"dev": "ml-dev-bucket",
"staging": "ml-staging-bucket",
"prod": "ml-prod-bucket",
}[ENV]
# Use DATA_BUCKET inside tasks/operators to route IO safely per environment.
Drills and exercises
- Create a simple 3-task DAG/flow with explicit dependencies
- Add retries=2, timeout=5m, and an SLA of 15m
- Parameterize the run date and pass it to all tasks
- Add a daily schedule at 03:00 and disable catchup
- Implement a daily partition and backfill the last 7 days
- Emit structured logs including run_id and partition key
- Make your tasks idempotent (safe reruns)
- Externalize config for dev/staging/prod
Common mistakes
- Non-idempotent tasks causing duplicated data on retries/backfills
- Overusing depends_on_past, causing deadlocks after failures
- Forgetting timeouts, resulting in stuck workers
- Too coarse partitions leading to long reruns, or too granular partitions overwhelming orchestration
- Hardcoding environment-specific paths/keys
- Lack of alerts: failures discovered late
Debugging tips
Fast triage checklist
- Check logs for the first failing task; fix root cause before rerunning downstream
- Verify parameters and dates passed to tasks
- Reproduce locally with the same inputs
- Clear failed task state only if the task is idempotent
- For backfills: limit concurrency and validate partitions exist
SLA vs timeout
Timeout kills a task run that exceeds the limit. SLA measures end-to-end or task latency and alerts if breached; it does not automatically kill or retry tasks.
Mini project: Daily retraining with safe backfills
Goal: Build a daily training pipeline that ingests features, trains a model, evaluates it, and writes metrics and the model artifact, supporting backfills and environment promotion.
- Create a DAG/flow with extract → train → evaluate → register steps
- Add parameters: run_date, model_name
- Set retries (2), timeouts (per task), and SLA (end-to-end)
- Implement daily partitions and a 14-day backfill
- Add logs with run_id, partition, and model metrics
- Externalize config for dev/staging/prod (buckets/paths)
Acceptance criteria
- Rerunning any partition is safe (no duplicates)
- Backfill 7 days completes within your configured concurrency
- Failure alerts include partition and task name
- Changing ENV switches IO locations without code changes
Subskills
This skill is composed of the following subskills:
- DAG Design And Dependencies
- Scheduling Backfills Partitions
- Retries Timeouts SLAs
- Parameterized Runs
- Event Driven Triggers Basics
- Observability For Workflows
- Managing Multi Environment Deployments
Who this is for
MLOps Engineers, ML Engineers, and Data Engineers who need reliable, observable, and maintainable ML pipelines across multiple environments.
Prerequisites
- Comfortable with Python scripting
- Basic understanding of ML pipelines (data → features → train → evaluate → deploy)
- Familiarity with containers and environment variables
Learning path
- Model a simple ETL DAG/flow and run it on a schedule
- Add parameters, retries, timeouts, and SLAs
- Introduce partitions and practice backfills
- Add event-driven triggers for new data
- Instrument logs/metrics and set alerts
- Promote the same workflow across dev/staging/prod
Practical projects
- Batch scoring pipeline with daily partitions and SLA
- Feature build pipeline triggered by upstream data arrival
- Weekly retraining with evaluation gates and rollback
Next steps
Complete the drills, build the mini project, then take the skill exam. Everyone can take the exam for free; only logged-in users will have their progress saved.