Lineage Source To Feature To Model

Learn Lineage Source To Feature To Model for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Why this matters

[ ] Define IDs for source snapshot, feature set, and model.
[ ] Include a stable business key + event_time for joining.
[ ] Show an example row with fields needed for traceability.

Show solution

{
  "ids": {
    "source_snapshot_id": "src_txn_2025_12_15",
    "feature_set_id": "fs_txn_roll7_v3_2025_12_15",
    "model_id": "mdl_fraud_lgbm_v4"
  },
  "join_keys": ["account_id", "event_time"],
  "prediction_row": {
    "account_id": "A12345",
    "event_time": "2025-12-16T10:00:00Z",
    "feature_set_id": "fs_txn_roll7_v3_2025_12_15",
    "model_id": "mdl_fraud_lgbm_v4",
    "model_version_hash": "h_model_2233",
    "source_snapshot_id": "src_txn_2025_12_15"
  }
}

Exercise checklist

[ ] Every artifact has an ID.
[ ] Parents are explicitly listed.
[ ] Code commit/environment recorded.
[ ] Time windows and seeds captured.
[ ] Data/model hashes present.

Common mistakes and self-check

Only versioning code, not data snapshots. Self-check: Can you point to the exact rows used? If not, snapshot missing.
Relying on table name without date/version. Self-check: If the table changed today, could you still rebuild yesterday's features?
Not recording label lineage. Self-check: Does your model lineage include label source ID and definition?
Skipping environment pinning. Self-check: Can you rebuild the env from a lockfile? If not, pin it.
Using unstable joins (e.g., surrogate row_number). Self-check: Do you use business keys and event_time?

Practical projects

Personal lineage logger: Instrument a small pipeline to emit lineage JSON files per job. Store them in a folder and query with simple scripts.
Drift investigation drill: Intentionally change a feature parameter and use lineage to identify the exact change that altered predictions.
Rollback dry-run: Given a model_id, follow parents to rebuild the training dataset and retrain. Verify hashes match.

Mini challenge

Given a production alert about prediction shift on 2026-01-01, list the minimal lineage fields you would inspect first and the likely root causes they reveal. Keep your answer under 7 bullet points.

Who this is for

MLOps Engineers ensuring reproducibility and governance.
Data Scientists who need reliable experiments and audits.
Data Engineers maintaining feature pipelines.

Prerequisites

Basic Git usage and environment pinning (e.g., lockfiles or containers).
Understanding of feature engineering and training workflows.
Ability to write/read JSON and logs.

Learning path

Versioned data snapshots.
Feature set versioning and definitions.
Source-to-feature-to-model lineage (this lesson).
Model registry and deployment lineage.
Monitoring and drift with lineage-based root cause analysis.

Next steps

[ ] Add lineage logging to one job this week (source snapshot ID, feature_set_id, parents).
[ ] Extend to labels and training datasets.
[ ] Build a simple lineage viewer from your JSON records.

Quick test note

The quick test below is available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

You have:

Source: customers_v4 snapshot on 2025-10-01 (src_cust_2025_10_01, hash h1).
Source: orders_v7 snapshot on 2025-10-01 (src_ord_2025_10_01, hash h2).
Feature job: agg_30d_spend v2 (commit c001, window=30d) output feature set fs_spend_v2_2025_10_01.
Model: churn_xgb_v5 trained 2025-10-02 with parents fs_spend_v2_2025_10_01 and labels from src_labels_2025_10_01.

Task: Write a lineage JSON that includes IDs, parents, commits, params, and hashes for feature_set and model.

Expected Output

A JSON object with feature_set and model blocks, each containing artifact_id, type, parents, code_commit, params or training_config, environment, hashes, and timestamps.