Source-to-Report Lineage: What It Is
Source-to-report lineage is the end-to-end map showing how data moves from original sources through ingestion, transformations, semantic models, and finally into dashboards and reports. It connects columns, datasets, and calculations across every step so you can answer: Where did this metric come from? What breaks if we change this field? Who owns each step?
Why this matters
- Root-cause analysis: Trace incorrect numbers back to the exact transformation or source change.
- Impact analysis: Predict which dashboards break if a column is renamed or logic changes.
- Governance and trust: Show auditors and stakeholders how metrics are calculated.
- Developer velocity: Reduce guesswork when refactoring pipelines or adding new metrics.
Concept explained simply
Think of lineage like a subway map:
- Stations = datasets (tables, views, files) and reports.
- Tracks = transformation steps (SQL, code, or business rules).
- Passengers = columns traveling through the system.
Two useful views:
- Business lineage: High-level flow (Source β Data Warehouse β Semantic Model β Report). Great for stakeholders.
- Technical lineage: Column-level dependencies and exact transformations (expressions, joins, filters). Essential for engineers and architects.
Mental model
Model your estate as a directed graph:
- Nodes: sources, staging, transforms, facts/dimensions, semantic models, reports.
- Edges: derivations (e.g., target.column derived from source.column via expression).
- Grain and keys: Keep track of natural keys, surrogate keys, and grain changes (row β aggregate).
Core building blocks
- Entities to represent:
- Sources (CRM, ERP, logs)
- Ingestion/staging layers
- Transformations (SQL, notebooks, jobs)
- Dimensional/fact models
- Semantic models (BI datasets, metric layers)
- Reports/dashboards and fields
- Metadata to capture (minimum viable):
- Dataset name, owner, schedule
- Column names, data types, primary/foreign keys
- Transformation expressions, filters, joins, aggregates
- Load strategy (full, incremental), SCD type if used
- Quality checks (nulls, uniqueness)
- Levels of detail:
- Dataset-level lineage (fast to start)
- Column-level lineage (high value for impact analysis)
Worked examples
Example 1: Customer dimension for an Active Customers report
Flow: crm.customers β stg.customers_clean β dim_customer (SCD2) β semantic model β report card "Active Customers YTD".
Key lineage points
- stg.customers_clean.email_normalized β lower(crm.customers.email)
- dim_customer.surrogate_key β generated key (row versioning)
- dim_customer.business_key β stg.customers_clean.customer_id
- dim_customer.status β CASE on stg.customers_clean.is_active
- Report field Active Customers YTD β COUNT DISTINCT dim_customer.business_key filtered by status = 'active' and date in YTD
Example 2: Monthly revenue aggregation
Flow: orders_raw β stg_orders β fct_order_line β fct_revenue_monthly β metric Revenue (report tile).
Key lineage points
- fct_order_line.revenue_local β quantity * unit_price
- fct_order_line.revenue_usd β revenue_local * fx_rate_on_order_date
- fct_revenue_monthly.revenue_usd β SUM(fct_order_line.revenue_usd) GROUP BY month, product_id
- Report KPI Revenue β SUM over report date window; filter by region if selected
Impact insight: Change to fx rate logic propagates to monthly revenue and all reports using it.
Example 3: Gross Margin % metric
Flow: fct_order_line + cogs_by_sku β fct_margin_daily β semantic metric Margin% β Dashboard tile.
Key lineage points
- fct_margin_daily.gross_margin β revenue_usd - cogs_usd
- fct_margin_daily.gross_margin_pct β gross_margin / NULLIF(revenue_usd, 0)
- Margin% tile β AVG or weighted calculation depending on definition; document choice
Designing your lineage model
Start simple and evolve:
- Nodes (datasets): name, type (source/stage/transform/semantic/report), owner
- Edges (dataset-level): from_dataset β to_dataset, job_id, schedule
- Column lineage (optional next step): target_dataset, target_column, source_dataset, source_column, expression
Minimal column-lineage record
target_dataset: dim_customer target_column: email_normalized source_dataset: stg.customers_clean source_column: email expression: lower(email)
Documenting grain and keys
- Grain: e.g., fct_order_line is one row per order line
- Keys: natural key (order_id), surrogate key (order_line_sk)
- Note grain changes on aggregates (daily β monthly)
Step-by-step: implement lineage for a new domain
- Choose scope: One metric or dashboard with 5β10 datasets.
- Inventory nodes: List all sources, tables, views, semantic models, and report tiles.
- Draw dataset-level flow: Source β staging β transforms β semantic β report.
- Capture column mappings: For each target column, note source column(s) and the expression.
- Add keys and grain: Record primary keys, foreign keys, and grain at each step.
- Record operational metadata: Owner, schedule, load type, quality checks.
- Validate with users: Walk through lineage with analysts and report owners.
Pro tip: prioritize
Track column-level lineage for critical metrics first (executive KPIs, regulatory reports). Keep the rest at dataset-level until needed.
Governance and change management
- Version lineage: Keep history when expressions or sources change.
- Impact analysis workflow:
- Proposed change β list affected downstream nodes and report fields
- Plan updates and tests β implement β validate
- Ownership: Each node should have an accountable owner.
- Quality gates: Add checks for nulls, duplicates, and referential integrity at key hops.
Who this is for
- Data Architects and Senior Data Engineers designing data platforms
- Analytics Engineers and BI Developers defining semantic layers
- Data Governance practitioners documenting lineage
Prerequisites
- Comfort with SQL (SELECT, JOIN, GROUP BY, expressions)
- Basic data modeling (dimensions, facts, keys)
- Understanding of ETL/ELT and scheduling
Learning path
- Start: Dataset-level lineage for 1β2 dashboards
- Next: Column-level lineage for key metrics
- Then: Add grain/keys and SCD documentation
- Later: Automate extraction from jobs and SQL; add tests and impact analysis
Common mistakes and how to self-check
- Missing grain/keys. Self-check: For each dataset, can you state the row grain and primary key?
- Ignoring filters. Self-check: Are WHERE clauses and security filters captured in lineage?
- Only dataset-level lineage. Self-check: For top 5 metrics, do you have column-level expressions?
- No owners. Self-check: Does every node list a responsible owner?
- Not versioning changes. Self-check: Can you show yesterdayβs lineage vs todayβs after a refactor?
Practical projects
- Document an executive KPI end-to-end (column-level), including expressions and grain changes.
- Add impact analysis notes for a planned column rename in a source table.
- Create a lineage "playbook" template your team can reuse.
Exercises
Exercise 1: Map column-level lineage for a cleaning step (ID: ex1)
You have source crm.customers with columns: customer_id, full_name, email, is_active, created_at. A staging transform produces:
SELECT c.customer_id AS customer_id, split_part(c.full_name, ' ', 1) AS first_name, split_part(c.full_name, ' ', 2) AS last_name, lower(c.email) AS email_normalized, CASE WHEN c.is_active THEN 'active' ELSE 'inactive' END AS status, c.created_at::date AS created_date FROM crm.customers c WHERE c.email IS NOT NULL;
Task: Write lineage mappings of the form target_column <- source_column(s) via expression. Include any filters that affect row lineage.
- [ ] Listed a mapping line for each target column
- [ ] Noted the WHERE filter in lineage notes
- [ ] Captured expression details (functions, CASE)
Exercise 2: Do impact analysis for a column rename (ID: ex2)
Assume crm.customers.email is renamed to email_address and upstream system now uppercases values. Identify which downstream objects are affected in this pipeline: crm.customers β stg.customers_clean β dim_customer β semantic model β Active Customers report. Provide concrete steps to update lineage and minimize downtime.
- [ ] Listed all affected transforms and report fields
- [ ] Specified code and lineage updates
- [ ] Included validation steps and communication to stakeholders
Mini challenge
Pick one dashboard tile in your environment (e.g., "Revenue by Region MTD"). On a single page, list:
- All upstream datasets back to sources
- For the final metric, its exact expression and filters
- Grain transitions and keys used at each hop
- One potential change (e.g., filter tweak) and its downstream impact
Next steps
- Automate lineage capture for your most critical jobs (start with SQL parsing or job metadata).
- Add ownership and schedules to each node; create a monthly review of top metricsβ lineage.
- Pilot impact analysis: simulate a change and rehearse the rollout steps.
Note: The Quick Test is available to everyone; only logged-in users get saved progress.