Why this matters
Data lineage shows where data comes from, how it changes, and where it goes. As a Data Engineer, you use lineage to:
- Debug broken dashboards by tracing the exact upstream table/column that changed.
- Do impact analysis for schema changes before you deploy.
- Prove compliance: show sources and transformations for sensitive fields.
- Reduce cost and risk by removing duplicate pipelines and orphaned tables.
- Speed up onboarding: new teammates see clear paths from sources to products.
Concept explained simply
Data lineage is the map of your data’s journey. It connects sources, transformations, and outputs as a directed graph.
Mental model
Think of a subway map:
- Stations = datasets/tables/views/files.
- Lines = transformations (SQL, code, ETL) that move and reshape data.
- Trains = rows/columns flowing through. Colors can represent columns or data classes (like PII).
Levels of lineage
- Asset-level: table/view/job relationships.
- Column-level: how each column is derived or mapped.
- Run-level: which job run created which version of a dataset.
- Business-level: semantic meaning ("Customer" = people who made a purchase in last 12 months).
Directions and types
- Backward (upstream) lineage: from a report back to original sources.
- Forward (downstream) lineage: from a source out to all consumers.
- Technical lineage: extracted from code/logs (SQL, jobs, orchestrators).
- Business lineage: described in plain language for stakeholders.
Minimal metadata to capture
- Source asset, operation (e.g., SELECT/JOIN/AGGREGATE), target asset.
- Timestamp and job run ID/version.
- Column mappings (when feasible).
- Owner/team and tool that performed the change.
- Hash or reference to code/config used.
Worked examples
Example 1: Simple SQL transformation
SQL:
CREATE TABLE mart.daily_sales AS
SELECT o.order_date::date AS order_day,
SUM(o.order_total) AS gross_sales
FROM staging.orders o
GROUP BY 1;
- Asset-level lineage: staging.orders -[aggregate]-> mart.daily_sales
- Column-level:
- mart.daily_sales.order_day <= staging.orders.order_date (cast to date)
- mart.daily_sales.gross_sales <= SUM(staging.orders.order_total)
- Run-level: job run 2026-01-08T02:15Z produced mart.daily_sales v12
Example 2: Join with column derivations
SQL:
CREATE VIEW mart.customer_region_sales AS
SELECT c.customer_id,
r.region_name,
SUM(o.order_total) AS total
FROM dim.customers c
JOIN dim.regions r ON c.region_id = r.id
JOIN fact.orders o ON o.customer_id = c.customer_id
GROUP BY 1,2;
- Assets: dim.customers + dim.regions + fact.orders -[join/aggregate]-> mart.customer_region_sales
- Column-level mappings:
- customer_id <= dim.customers.customer_id
- region_name <= dim.regions.region_name (via c.region_id = r.id)
- total <= SUM(fact.orders.order_total)
- Sensitive data note: if customer_id is PII, mark downstream column accordingly.
Example 3: Orchestrated pipeline across tools
Flow:
- Ingest: S3 raw/orders/*.json -[ingestion job]-> bronze.orders_raw
- Clean: bronze.orders_raw -[dbt model]-> silver.orders_clean
- Model: silver.orders_clean + dim.calendar -[dbt model]-> gold.daily_revenue
- Serve: gold.daily_revenue -[BI refresh]-> Dashboard "Daily Revenue"
Lineage graph connects cloud storage, warehouse layers, and BI. Each edge stores tool (Airflow/dbt/BI), run ID, and code ref.
How to capture lineage (practical steps)
- Inventory assets: list sources, staging, core models, marts, and BI datasets.
- Collect edges automatically when possible: parse SQL, job configs, and execution logs.
- Normalize names: use consistent qualifiers (catalog.schema.table) and stable IDs.
- Record minimal metadata per edge: source, operation, target, timestamp, run ID, tool, code hash.
- Capture column mappings for key models: especially for regulated or high-value tables.
- Validate: spot-check with sample queries; compare expected vs observed read/write sets.
- Visualize: render a graph view filtered by system, domain, or sensitivity.
- Automate updates: run lineage extraction on each deploy or scheduled interval.
Who this is for
- Data Engineers who ship and maintain pipelines.
- Analytics Engineers who own models and BI datasets.
- Platform Engineers enabling governance and metadata.
Prerequisites
- Comfort with SQL (SELECT, JOIN, GROUP BY, CTEs).
- Basic understanding of your data platform (warehouse/lake, orchestrator).
- Familiarity with dataset naming conventions and ownership.
Learning path
- Access and data classification basics (sensitivity, ownership).
- Data lineage concepts (this lesson): assets, edges, directions, granularity.
- Provenance and audit trails (runs, versions, who changed what).
- Controls and monitoring (policy checks, freshness, incident playbooks).
- Lifecycle and cleanup (deprecation with impact analysis).
Exercises
Try these practical tasks. Compare your results with the solutions (hidden below). Use this checklist as you work:
- Exercise 1: Trace lineage for a daily sales table.
Given a transform that creates mart.daily_sales from staging.orders, describe asset-level and column-level lineage, including operation types and any casts. - Exercise 2: Write a lineage statement for a join.
Given a view joining dim.customers, dim.regions, and fact.orders, document source assets, key joins, and column mappings. - Exercise 3: Impact analysis (backward vs forward).
If dim.regions adds a new column region_code and renames region_name to name, identify backward and forward lineage paths and list impacted downstream assets and columns.
Common mistakes and how to self-check
Ignoring column-level lineage
Risk: hidden PII propagation and broken reports. Self-check: for top 10 critical tables, map the columns that feed KPIs or sensitive fields.
Stale lineage (not tied to runs/versions)
Risk: false confidence during incidents. Self-check: can you answer "which run created this table version?" If not, add run IDs and timestamps.
Inconsistent naming and owners
Risk: fragmented graphs. Self-check: enforce a naming convention and ensure each asset has an owner/team.
Missing external edges
Risk: blind spots with files/BI extracts. Self-check: add ingestion edges for object storage, and serving edges into BI tools.
No validation
Risk: incorrect graphs from parser errors. Self-check: sample-run comparisons between actual read/write sets and stored lineage.
Practical projects
- Build a lineage map for one domain (e.g., Sales): sources -> staging -> core -> marts -> dashboards. Include at least 3 column-level mappings.
- Create an impact analysis playbook: steps to check before renaming/dropping a column, plus communication template to affected teams.
- Add lineage requirements to code reviews: any SQL change includes a brief lineage diff (assets and key column mappings).
Mini challenge
Pick a KPI dashboard and, in 10 minutes, sketch its upstream lineage to raw sources and the top three transformations. Mark where data types change or where PII enters/leaves. What’s the riskiest edge? How would you validate it?
Next steps
- Document lineage for one critical job today; automate extraction this week.
- Tag sensitive columns and verify their downstream lineage.
- Take the quick test below to check your understanding. Note: anyone can take the test; sign in to save your progress.
Quick Test
Everyone can take the Quick Test below. Only logged-in users will have results saved to their progress.