Who this is for
For MLOps Engineers who want to reuse the same training or batch inference pipeline across datasets, environments, and schedules without rewriting code.
- You build recurring training or batch jobs (nightly, weekly, ad-hoc).
- You need consistent runs across dev/staging/prod with minimal changes.
- You care about reproducibility, drift control, and cost-safe runs.
Prerequisites
- Basic Python for scripting pipelines.
- Familiarity with a workflow tool (e.g., Airflow, Prefect, or Kubeflow Pipelines).
- Comfort with YAML/JSON configuration files.
Why this matters
Real MLOps tasks rely on templates and parameters to keep pipelines maintainable:
- Spin up a training job for multiple regions or customers by changing only a config file.
- Promote the same pipeline from dev to prod with proven defaults and safe overrides.
- Create reproducible model builds using locked params (seeds, image tags, dataset versions).
- Run large backfills in batches by varying time windows without editing code.
Concept explained simply
A pipeline template is a reusable recipe for your ML workflow (steps, wiring, I/O). Parameterization gives that template adjustable knobs (e.g., dataset path, hyperparameters, compute size). You keep logic in one place and change behavior via parameters.
Mental model
Imagine a cookie cutter (template) and different doughs (parameters). The cutter defines the shape (steps and order). The dough defines taste and texture (data paths, seeds, resources). You get consistent shape with flexible outcomes.
Quick glossary
- Template: Fixed structure of tasks and data flow.
- Parameters: Named values injected at run time or deploy time.
- Defaults: Safe baseline values when nothing is provided.
- Overrides: Environment- or run-specific values that replace defaults.
- Artifacts: Outputs (metrics, model files) tagged by parameter values.
Worked examples
Example 1: Airflow DAG factory with config
Pattern: one Python factory that builds a DAG from a dictionary/YAML.
from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
DEFAULTS = {
"dataset_uri": "s3://ml/dev/data.csv",
"epochs": 5,
"seed": 42,
"resources": {"cpu": 2, "mem_gb": 4}
}
def make_dag(name, params):
cfg = {**DEFAULTS, **(params or {})}
def preprocess(**context):
# read cfg['dataset_uri'], write to staging
pass
def train(**context):
# use cfg['epochs'], cfg['seed']
pass
with DAG(dag_id=name, start_date=datetime(2023,1,1), schedule_interval=None, catchup=False) as dag:
t1 = PythonOperator(task_id="preprocess", python_callable=preprocess)
t2 = PythonOperator(task_id="train", python_callable=train)
t1 >> t2
dag.params = cfg # attach for visibility
return dag
# Instantiate DAGs (dev/prod) without duplicating code
DEV_DAG = make_dag("train_dev", {"dataset_uri": "s3://ml/dev/data.csv"})
PROD_DAG = make_dag("train_prod", {"dataset_uri": "s3://ml/prod/data.parquet", "epochs": 20})
What to check
- Parameters live outside task functions (pass via cfg or DAG params).
- Dev and prod DAGs differ only by params.
- No hard-coded paths inside tasks.
Example 2: Kubeflow Pipelines with typed pipeline parameters
Pattern: define a pipeline once; pass parameters per run or per recurring schedule.
from kfp import dsl
@dsl.component
def preprocess(dataset_uri: str) -> str:
# Download/clean and return path to processed data
return dataset_uri + ":processed"
@dsl.component
def train(data_path: str, epochs: int, seed: int) -> str:
# Train and return model URI
return f"gs://models/model_ep{epochs}_seed{seed}.bin"
@dsl.pipeline(name="training-template")
def training_pipeline(dataset_uri: str, epochs: int = 5, seed: int = 42):
p = preprocess(dataset_uri)
m = train(p.output, epochs, seed)
# At run time (UI/CLI), set dataset_uri, epochs, seed as needed.
What to check
- All variability is expressed as pipeline parameters.
- Reasonable defaults make ad-hoc runs easy and safe.
- Outputs include parameter values in names/tags for traceability.
Example 3: Prefect deployment with parameters
Pattern: one flow; different deployments or runs with params.
from prefect import flow, task
@task
def preprocess(dataset_uri: str) -> str:
return dataset_uri + ":processed"
@task
def train(data_path: str, epochs: int, seed: int) -> str:
return f"file:///models/model_{epochs}_{seed}.pt"
@flow
def training_flow(dataset_uri: str, epochs: int = 5, seed: int = 42):
p = preprocess(dataset_uri)
return train(p, epochs, seed)
# Run with different parameters
if __name__ == "__main__":
training_flow(dataset_uri="s3://ml/dev/data.csv")
training_flow(dataset_uri="s3://ml/prod/data.parquet", epochs=20, seed=2024)
What to check
- Flow code is unchanged between runs.
- Parameters are explicit and typed.
- Seeds are parameters for reproducible experiments.
Step-by-step: Template a training pipeline
- List variability: data locations, time windows, seeds, hyperparameters, resources, image/version.
- Create defaults: choose safe values that run quickly and cheaply (e.g., small epochs, sample dataset).
- Expose parameters: add function args or pipeline params for every variable item.
- Separate config: keep env-specific overrides in YAML/JSON files (dev.yaml, prod.yaml).
- Inject params: use a loader/merger to combine defaults with environment overrides at runtime.
- Tag artifacts: include key params (dataset version, seed, code hash) in model and metric metadata.
- Dry-run and guardrails: validate parameters (ranges, existence of paths) before execution.
Validation checklist (toggle)
- All hard-coded values reviewed and replaced by parameters or constants on purpose.
- Sensible defaults exist; production overrides are explicit.
- Parameters validated before any expensive step.
- Artifacts and logs include parameter values.
Exercises
Do these now. They mirror the tasks below and you can reveal solutions if you get stuck.
Exercise 1: Parameterize a training job spec
Refactor an unparameterized YAML job into a template with defaults and per-environment overrides.
- ID: ex1
- Goal: Replace hard-coded dataset path, epochs, seed, and resources with parameters.
Show starter YAML
job:
name: nightly-train
image: ml/train:1.0.0
command: ["python", "train.py"]
env:
DATASET_URI: s3://company/prod/data.parquet
EPOCHS: "20"
SEED: "123"
resources:
cpu: 8
mem_gb: 32
Deliverables:
- A template YAML with placeholders and safe defaults.
- An example override snippet for prod.
Exercise 2: Write a pipeline factory function
Create a Python function make_training_pipeline(config) that builds a pipeline/flow using parameters from a dict. Demonstrate two runs (dev and prod) with different inputs but the same code.
- ID: ex2
- Goal: Prove that logic stays constant while behavior changes via config.
Submission checklist
- Parameters are explicit and validated.
- Defaults exist and are conservative.
- No hard-coded environment paths remain.
- Output names or metadata include key parameter values.
Common mistakes and self-check
- Hard-coding paths or secrets inside code. Fix: load from parameters or secret stores; never bake secrets in templates.
- Missing defaults. Fix: add safe defaults that enable quick local runs.
- Unvalidated parameters. Fix: assert ranges/types; fail fast before costly steps.
- Non-reproducible randomness. Fix: expose seed as a parameter and log it.
- Silent overrides. Fix: log the final merged config; print a diff vs defaults.
Self-check prompts
- Can you run the same template in dev and prod by changing only a config file?
- Does the pipeline print a single, merged, final config at start?
- Can you re-run a previous job exactly by reusing the same parameters?
Practical projects
- Multi-tenant training: One template trains models for three clients by swapping dataset prefixes and output locations.
- Backfill pipeline: Run the same batch inference for the last 12 months by parameterizing a date window.
- Promotion flow: A single template that can run with dev defaults and a prod override file; includes a dry-run parameter that skips training and validates inputs only.
Learning path
- Start: Pipeline templates and parameterization (this page).
- Next: Scheduling and backfills; artifact versioning; feature store integration.
- Then: CI/CD for pipelines; canary training; governance and approvals.
Next steps
- Complete the exercises above and compare with the solutions.
- Take the quick test below to lock in concepts. The test is available to everyone; only logged-in users will have their progress saved.
- Apply the template pattern to one of your existing jobs this week.
Mini challenge
Design a single pipeline template that can: (1) train with a sampled dataset in dev, (2) train full-scale in prod, and (3) run batch inference for a custom date range. List the parameters you would add, their defaults, and one guardrail validation for each (e.g., epochs must be 1–200, date range max 31 days). Keep it to 10 parameters or fewer.