Who this is for
- MLOps Engineers who deploy the same pipeline across datasets, environments, or model versions.
- Data Scientists who want reproducible experiments with different hyperparameters.
- Data/Platform Engineers standardizing pipelines across teams.
Prerequisites
- Basic Python and familiarity with at least one workflow engine (e.g., Airflow, Prefect, Dagster, Kubeflow).
- Comfort with YAML/JSON for configs.
- Understanding of environment separation (dev/stage/prod) and basic secrets handling.
Why this matters
In real MLOps work, you rarely write a brand-new pipeline for each case. Instead, you reuse one pipeline and pass parameters to change inputs, dates, hyperparameters, environments, or output locations. Parameterized runs:
- Eliminate copy-paste pipelines and reduce maintenance.
- Enable safe promotion from dev to prod by flipping parameters (e.g., connections, buckets, feature toggles).
- Power backfills, A/B evaluations, and hyperparameter sweeps.
- Improve reproducibility: every run records the exact parameters that produced the result.
Concept explained simply
A parameterized run is a single pipeline that behaves differently based on inputs you pass at runtime, like dataset=orders, date=2025-12-01, or n_estimators=200. The code stays the same; the behavior changes.
Mental model
- Inputs: parameters come from CLI, UI, API, schedules, or files.
- Defaults: reasonable defaults make ad-hoc runs easy.
- Validation: reject bad parameters early (types, ranges, enums).
- Idempotency: same parameters → same outputs. Include parameters in artifact paths and run IDs.
- Lineage: log parameters with the run so results are traceable.
- Security: secrets are not regular parameters—use secret stores/engine-native secret management.
Core patterns you should know
Pattern 1 — Typed schema and validation
Define expected types, enums, and ranges. Fail fast with meaningful errors if parameters are invalid.
Pattern 2 — Idempotent run IDs and artifact names
Derive a run_id from a stable hash of parameters. Include parameters in output paths like s3://bucket/model={model}/date={date}/.
Pattern 3 — Fan-out (matrix) runs
Generate multiple child runs for a grid of parameters (e.g., hyperparameter sweep). Keep per-child run IDs and aggregate results at the end.
Pattern 4 — Parameter sources
- CLI flags (local dev), UI forms (ad-hoc), API (automation), schedule (recurring defaulted params), and config files.
- Never hardcode environment-specific values in code; pass them as parameters or resolve via environment configs.
Worked examples
Example 1 — Airflow DAG with params and dag_run.conf
from airflow.decorators import dag, task
from airflow.models.param import Param
from datetime import datetime
import hashlib
@dag(
schedule=None,
start_date=datetime(2024,1,1),
catchup=False,
params={
"dataset": Param("orders", enum=["orders", "customers"]),
"as_of": Param("2025-12-01", type="string"),
"model_version": Param("v1", type="string")
}
)
def train_pipeline():
@task
def resolve_params(**context):
conf = (context["dag_run"].conf or {})
p = {k: conf.get(k, context["params"][k]) for k in ["dataset", "as_of", "model_version"]}
rid = hashlib.sha1(str(sorted(p.items())).encode()).hexdigest()[:10]
return {**p, "run_id": rid}
@task
def load_data(p):
print(f"Loading {p['dataset']} as_of={p['as_of']}")
return {"rows": 123}
@task
def train(p, data):
path = f"s3://ml/artifacts/{p['model_version']}/dataset={p['dataset']}/date={p['as_of']}/run={p['run_id']}"
print(f"Training and saving to {path}")
return {"model_path": path}
p = resolve_params()
d = load_data(p)
_ = train(p, d)
pipeline = train_pipeline()
Trigger via UI or API; pass overrides in dag_run.conf to reuse the same DAG.
Example 2 — Prefect flow with deployment parameters
from prefect import flow, task
from datetime import date
import hashlib
@task
def load(dataset: str, as_of: str):
print(f"Load {dataset} as_of={as_of}")
return [1,2,3]
@task
def train(model_version: str, data):
print(f"Train {model_version} on {len(data)} rows")
@flow
def ml_pipeline(dataset: str = "orders", as_of: str | None = None, model_version: str = "v1"):
as_of = as_of or date.today().isoformat()
rid = hashlib.sha1(f"{dataset}-{as_of}-{model_version}".encode()).hexdigest()[:8]
data = load(dataset, as_of)
train(model_version, data)
print(f"run_id={rid}")
if __name__ == "__main__":
ml_pipeline(dataset="customers", as_of="2025-12-01", model_version="v2")
In a Prefect Deployment, set default parameters per environment; override at run time via UI/API.
Example 3 — Dagster job with config and partitioned runs
from dagster import job, op, Config, define_asset_job
class TrainConfig(Config):
dataset: str = "orders"
as_of: str = "2025-12-01"
model_version: str = "v1"
@op
def load_op(cfg: TrainConfig):
print(f"Load {cfg.dataset} @ {cfg.as_of}")
return {"rows": 111}
@op
def train_op(cfg: TrainConfig, data):
print(f"Train {cfg.model_version} on {data['rows']} rows")
@job
def train_job():
d = load_op()
train_op(d)
# Provide run config at execution time:
# run_config = {"ops": {"load_op": {"config": {"dataset": "customers", "as_of": "2025-12-02", "model_version": "v2"}}}}
# dagster job execute with run_config to parameterize the run.
Use partitioned configs for backfills (e.g., per-date partitions) to fan out runs automatically.
Set up your first parameterized run (step-by-step)
- List your variable inputs: data source, date, environment, hyperparameters, output paths.
- Define defaults and a validation schema (types, ranges, enums).
- Propagate parameters to every task that needs them—avoid global state.
- Derive an idempotent run_id from parameters; include it in artifact paths.
- Log parameters in your run metadata and in model cards/metrics.
- Test with both defaults and overrides (CLI/UI/API).
Common mistakes and how to self-check
- Missing validation: Add clear errors when values are out of range or wrong type.
- Implicit globals: Ensure tasks read from passed parameters, not module-level variables.
- Non-idempotent outputs: Include parameters in output paths; rerunning should not overwrite unrelated artifacts.
- Leaking secrets: Pass secret references via your engine's secret store; never as plain-text parameters.
- Poor lineage: Always log the full parameter set with the run results.
Self-check: Given the same parameters, do you get the same outputs and paths? Can you reconstruct which parameters produced a model from your metadata alone?
Hands-on exercises
Mirror of the exercises below. Do them in your preferred orchestration engine.
Exercise 1: Single-run parameterization
- Create a pipeline with parameters: dataset (orders/customers), as_of (YYYY-MM-DD), model_version (v1/v2).
- Print all parameters and create a run_id from them (e.g., short hash).
- Write outputs to a path that includes model_version, dataset, as_of, and run_id.
Exercise 2: Validation and idempotency
- Add validation: dataset must be one of [orders, customers], as_of must parse as a date, model_version not empty.
- Fail fast with a clear message if invalid.
- Prove idempotency: running the same parameters twice yields the same run_id and same artifact path.
Checklist
- Parameters have defaults and validation.
- Artifacts include parameters and run_id in their paths.
- Runs log full parameter sets in metadata.
- No secrets are passed as plain parameters.
Practical projects
- Backfill project: Fan-out runs for the last 14 days using a date parameter; aggregate success metrics at the end.
- Hyperparameter sweep: Run grid search over two parameters and produce a leaderboard artifact.
- Blue/green deploy: Use an environment parameter to write to separate buckets and compare model KPIs.
Learning path
- Define a simple pipeline with hardcoded values.
- Introduce parameters with defaults and validation.
- Add idempotent run IDs and lineage logging.
- Implement fan-out (matrix) runs and result aggregation.
- Parameterize environment-specific configs (dev/stage/prod).
Next steps
- Integrate parameters with your experiment tracker to log configs alongside metrics.
- Create a library module to standardize parameter parsing, validation, and run_id generation across pipelines.
- Automate backfills via schedules that set default parameter windows.
Mini challenge
Design a parameter set to safely promote a model from staging to production without code changes. Include parameters for input source, output path, model_version, and any feature flag needed. Describe how you would validate, log, and derive run_id.
Quick test
Available to everyone. Only logged-in users have their progress saved.