Menu

Topic 8 of 8

Lineage Source To Report

Learn Lineage Source To Report for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Source-to-Report Lineage: What It Is

Source-to-report lineage is the end-to-end map showing how data moves from original sources through ingestion, transformations, semantic models, and finally into dashboards and reports. It connects columns, datasets, and calculations across every step so you can answer: Where did this metric come from? What breaks if we change this field? Who owns each step?

Why this matters

  • Root-cause analysis: Trace incorrect numbers back to the exact transformation or source change.
  • Impact analysis: Predict which dashboards break if a column is renamed or logic changes.
  • Governance and trust: Show auditors and stakeholders how metrics are calculated.
  • Developer velocity: Reduce guesswork when refactoring pipelines or adding new metrics.

Concept explained simply

Think of lineage like a subway map:

  • Stations = datasets (tables, views, files) and reports.
  • Tracks = transformation steps (SQL, code, or business rules).
  • Passengers = columns traveling through the system.

Two useful views:

  • Business lineage: High-level flow (Source β†’ Data Warehouse β†’ Semantic Model β†’ Report). Great for stakeholders.
  • Technical lineage: Column-level dependencies and exact transformations (expressions, joins, filters). Essential for engineers and architects.
Mental model

Model your estate as a directed graph:

  • Nodes: sources, staging, transforms, facts/dimensions, semantic models, reports.
  • Edges: derivations (e.g., target.column derived from source.column via expression).
  • Grain and keys: Keep track of natural keys, surrogate keys, and grain changes (row β†’ aggregate).

Core building blocks

  • Entities to represent:
    • Sources (CRM, ERP, logs)
    • Ingestion/staging layers
    • Transformations (SQL, notebooks, jobs)
    • Dimensional/fact models
    • Semantic models (BI datasets, metric layers)
    • Reports/dashboards and fields
  • Metadata to capture (minimum viable):
    • Dataset name, owner, schedule
    • Column names, data types, primary/foreign keys
    • Transformation expressions, filters, joins, aggregates
    • Load strategy (full, incremental), SCD type if used
    • Quality checks (nulls, uniqueness)
  • Levels of detail:
    • Dataset-level lineage (fast to start)
    • Column-level lineage (high value for impact analysis)

Worked examples

Example 1: Customer dimension for an Active Customers report

Flow: crm.customers β†’ stg.customers_clean β†’ dim_customer (SCD2) β†’ semantic model β†’ report card "Active Customers YTD".

Key lineage points
  • stg.customers_clean.email_normalized ← lower(crm.customers.email)
  • dim_customer.surrogate_key ← generated key (row versioning)
  • dim_customer.business_key ← stg.customers_clean.customer_id
  • dim_customer.status ← CASE on stg.customers_clean.is_active
  • Report field Active Customers YTD ← COUNT DISTINCT dim_customer.business_key filtered by status = 'active' and date in YTD

Example 2: Monthly revenue aggregation

Flow: orders_raw β†’ stg_orders β†’ fct_order_line β†’ fct_revenue_monthly β†’ metric Revenue (report tile).

Key lineage points
  • fct_order_line.revenue_local ← quantity * unit_price
  • fct_order_line.revenue_usd ← revenue_local * fx_rate_on_order_date
  • fct_revenue_monthly.revenue_usd ← SUM(fct_order_line.revenue_usd) GROUP BY month, product_id
  • Report KPI Revenue ← SUM over report date window; filter by region if selected

Impact insight: Change to fx rate logic propagates to monthly revenue and all reports using it.

Example 3: Gross Margin % metric

Flow: fct_order_line + cogs_by_sku β†’ fct_margin_daily β†’ semantic metric Margin% β†’ Dashboard tile.

Key lineage points
  • fct_margin_daily.gross_margin ← revenue_usd - cogs_usd
  • fct_margin_daily.gross_margin_pct ← gross_margin / NULLIF(revenue_usd, 0)
  • Margin% tile ← AVG or weighted calculation depending on definition; document choice

Designing your lineage model

Start simple and evolve:

  • Nodes (datasets): name, type (source/stage/transform/semantic/report), owner
  • Edges (dataset-level): from_dataset β†’ to_dataset, job_id, schedule
  • Column lineage (optional next step): target_dataset, target_column, source_dataset, source_column, expression
Minimal column-lineage record
target_dataset: dim_customer
target_column: email_normalized
source_dataset: stg.customers_clean
source_column: email
expression: lower(email)
Documenting grain and keys
  • Grain: e.g., fct_order_line is one row per order line
  • Keys: natural key (order_id), surrogate key (order_line_sk)
  • Note grain changes on aggregates (daily β†’ monthly)

Step-by-step: implement lineage for a new domain

  1. Choose scope: One metric or dashboard with 5–10 datasets.
  2. Inventory nodes: List all sources, tables, views, semantic models, and report tiles.
  3. Draw dataset-level flow: Source β†’ staging β†’ transforms β†’ semantic β†’ report.
  4. Capture column mappings: For each target column, note source column(s) and the expression.
  5. Add keys and grain: Record primary keys, foreign keys, and grain at each step.
  6. Record operational metadata: Owner, schedule, load type, quality checks.
  7. Validate with users: Walk through lineage with analysts and report owners.
Pro tip: prioritize

Track column-level lineage for critical metrics first (executive KPIs, regulatory reports). Keep the rest at dataset-level until needed.

Governance and change management

  • Version lineage: Keep history when expressions or sources change.
  • Impact analysis workflow:
    • Proposed change β†’ list affected downstream nodes and report fields
    • Plan updates and tests β†’ implement β†’ validate
  • Ownership: Each node should have an accountable owner.
  • Quality gates: Add checks for nulls, duplicates, and referential integrity at key hops.

Who this is for

  • Data Architects and Senior Data Engineers designing data platforms
  • Analytics Engineers and BI Developers defining semantic layers
  • Data Governance practitioners documenting lineage

Prerequisites

  • Comfort with SQL (SELECT, JOIN, GROUP BY, expressions)
  • Basic data modeling (dimensions, facts, keys)
  • Understanding of ETL/ELT and scheduling

Learning path

  • Start: Dataset-level lineage for 1–2 dashboards
  • Next: Column-level lineage for key metrics
  • Then: Add grain/keys and SCD documentation
  • Later: Automate extraction from jobs and SQL; add tests and impact analysis

Common mistakes and how to self-check

  • Missing grain/keys. Self-check: For each dataset, can you state the row grain and primary key?
  • Ignoring filters. Self-check: Are WHERE clauses and security filters captured in lineage?
  • Only dataset-level lineage. Self-check: For top 5 metrics, do you have column-level expressions?
  • No owners. Self-check: Does every node list a responsible owner?
  • Not versioning changes. Self-check: Can you show yesterday’s lineage vs today’s after a refactor?

Practical projects

  • Document an executive KPI end-to-end (column-level), including expressions and grain changes.
  • Add impact analysis notes for a planned column rename in a source table.
  • Create a lineage "playbook" template your team can reuse.

Exercises

Exercise 1: Map column-level lineage for a cleaning step (ID: ex1)

You have source crm.customers with columns: customer_id, full_name, email, is_active, created_at. A staging transform produces:

SELECT
  c.customer_id AS customer_id,
  split_part(c.full_name, ' ', 1) AS first_name,
  split_part(c.full_name, ' ', 2) AS last_name,
  lower(c.email) AS email_normalized,
  CASE WHEN c.is_active THEN 'active' ELSE 'inactive' END AS status,
  c.created_at::date AS created_date
FROM crm.customers c
WHERE c.email IS NOT NULL;

Task: Write lineage mappings of the form target_column <- source_column(s) via expression. Include any filters that affect row lineage.

  • [ ] Listed a mapping line for each target column
  • [ ] Noted the WHERE filter in lineage notes
  • [ ] Captured expression details (functions, CASE)

Exercise 2: Do impact analysis for a column rename (ID: ex2)

Assume crm.customers.email is renamed to email_address and upstream system now uppercases values. Identify which downstream objects are affected in this pipeline: crm.customers β†’ stg.customers_clean β†’ dim_customer β†’ semantic model β†’ Active Customers report. Provide concrete steps to update lineage and minimize downtime.

  • [ ] Listed all affected transforms and report fields
  • [ ] Specified code and lineage updates
  • [ ] Included validation steps and communication to stakeholders

Mini challenge

Pick one dashboard tile in your environment (e.g., "Revenue by Region MTD"). On a single page, list:

  • All upstream datasets back to sources
  • For the final metric, its exact expression and filters
  • Grain transitions and keys used at each hop
  • One potential change (e.g., filter tweak) and its downstream impact

Next steps

  • Automate lineage capture for your most critical jobs (start with SQL parsing or job metadata).
  • Add ownership and schedules to each node; create a monthly review of top metrics’ lineage.
  • Pilot impact analysis: simulate a change and rehearse the rollout steps.

Note: The Quick Test is available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

From crm.customers to stg.customers_clean using the provided SELECT, write lineage mappings of the form target_column <- source_column(s) via expression. Include filter details that affect row lineage.

SELECT
  c.customer_id AS customer_id,
  split_part(c.full_name, ' ', 1) AS first_name,
  split_part(c.full_name, ' ', 2) AS last_name,
  lower(c.email) AS email_normalized,
  CASE WHEN c.is_active THEN 'active' ELSE 'inactive' END AS status,
  c.created_at::date AS created_date
FROM crm.customers c
WHERE c.email IS NOT NULL;
Expected Output
customer_id <- crm.customers.customer_id first_name <- crm.customers.full_name via split_part(full_name, ' ', 1) last_name <- crm.customers.full_name via split_part(full_name, ' ', 2) email_normalized <- crm.customers.email via lower(email) status <- crm.customers.is_active via CASE WHEN is_active THEN 'active' ELSE 'inactive' created_date <- crm.customers.created_at via ::date Row filter: WHERE crm.customers.email IS NOT NULL

Lineage Source To Report β€” Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Lineage Source To Report?

AI Assistant

Ask questions about this tool