Why this matters
- [ ] Define IDs for source snapshot, feature set, and model.
- [ ] Include a stable business key + event_time for joining.
- [ ] Show an example row with fields needed for traceability.
Show solution
{
"ids": {
"source_snapshot_id": "src_txn_2025_12_15",
"feature_set_id": "fs_txn_roll7_v3_2025_12_15",
"model_id": "mdl_fraud_lgbm_v4"
},
"join_keys": ["account_id", "event_time"],
"prediction_row": {
"account_id": "A12345",
"event_time": "2025-12-16T10:00:00Z",
"feature_set_id": "fs_txn_roll7_v3_2025_12_15",
"model_id": "mdl_fraud_lgbm_v4",
"model_version_hash": "h_model_2233",
"source_snapshot_id": "src_txn_2025_12_15"
}
}Exercise checklist
- [ ] Every artifact has an ID.
- [ ] Parents are explicitly listed.
- [ ] Code commit/environment recorded.
- [ ] Time windows and seeds captured.
- [ ] Data/model hashes present.
Common mistakes and self-check
- Only versioning code, not data snapshots. Self-check: Can you point to the exact rows used? If not, snapshot missing.
- Relying on table name without date/version. Self-check: If the table changed today, could you still rebuild yesterday's features?
- Not recording label lineage. Self-check: Does your model lineage include label source ID and definition?
- Skipping environment pinning. Self-check: Can you rebuild the env from a lockfile? If not, pin it.
- Using unstable joins (e.g., surrogate row_number). Self-check: Do you use business keys and event_time?
Practical projects
- Personal lineage logger: Instrument a small pipeline to emit lineage JSON files per job. Store them in a folder and query with simple scripts.
- Drift investigation drill: Intentionally change a feature parameter and use lineage to identify the exact change that altered predictions.
- Rollback dry-run: Given a model_id, follow parents to rebuild the training dataset and retrain. Verify hashes match.
Mini challenge
Given a production alert about prediction shift on 2026-01-01, list the minimal lineage fields you would inspect first and the likely root causes they reveal. Keep your answer under 7 bullet points.
Who this is for
- MLOps Engineers ensuring reproducibility and governance.
- Data Scientists who need reliable experiments and audits.
- Data Engineers maintaining feature pipelines.
Prerequisites
- Basic Git usage and environment pinning (e.g., lockfiles or containers).
- Understanding of feature engineering and training workflows.
- Ability to write/read JSON and logs.
Learning path
- Versioned data snapshots.
- Feature set versioning and definitions.
- Source-to-feature-to-model lineage (this lesson).
- Model registry and deployment lineage.
- Monitoring and drift with lineage-based root cause analysis.
Next steps
- [ ] Add lineage logging to one job this week (source snapshot ID, feature_set_id, parents).
- [ ] Extend to labels and training datasets.
- [ ] Build a simple lineage viewer from your JSON records.
Quick test note
The quick test below is available to everyone; only logged-in users get saved progress.