Why this matters
Lineage from source to BI shows how data moves from original systems through transformations into dashboards. As an Analytics Engineer, you use lineage to troubleshoot data issues, plan changes safely, estimate effort, and communicate the impact of upstream changes to stakeholders.
- Real task: A sales field is renamed in the CRM. With lineage, you quickly find all models, metrics, and dashboards affected.
- Real task: A KPI looks off. Lineage helps you trace how it’s calculated and where it might have drifted.
- Real task: You’re adding a new dimension. Lineage tells you which models to update and tests to add.
Who this is for
Analytics Engineers, BI Developers, and Data Analysts who maintain pipelines or dashboards and need clear, reliable documentation of dependencies.
Prerequisites
- Comfort reading SQL (SELECT, JOIN, CTEs).
- Basic understanding of your data warehouse layers (staging, models, marts).
- Familiarity with your BI tool’s datasets/metrics or dashboards.
Concept explained simply
Data lineage is a map of where data comes from, how it changes, and where it ends up.
Mental model
Think of your data platform as a metro map:
- Stations = datasets (tables/views, BI datasets, dashboards)
- Lines = transformations (SQL models, joins, metrics logic)
- Direction = upstream to downstream (source → warehouse → BI)
When a station closes (a schema change), you can instantly see which lines and stations are impacted.
Levels of lineage
- System-level: Source systems (CRM, app events, ERP)
- Dataset-level: Raw/staging, intermediate models, marts
- Column-level: Which columns feed which calculations
- Metric-level: How metrics are defined and aggregated
- BI-asset-level: Datasets, dashboards, tiles/charts
Key terms
- Upstream: Inputs that feed a dataset or metric
- Downstream: Assets that depend on a dataset or metric
- Data contract: Agreement on schema, types, and SLAs of a source
- Freshness SLA: Expected max delay before data is updated
Minimal lineage spec you can use today
Use this compact spec to document any node (table/model/metric/dashboard). Keep it consistent and visible to your team.
node_name: <unique name>
node_type: source|staging|model|mart|metric|bi_dataset|dashboard
system: <where it lives (e.g., warehouse/BI)>
owner: <team or person>
freshness_sla: <e.g., 2h>
pii: none|low|high
business_definition: <clear, human-readable meaning>
dependencies:
- <upstream node name>
outputs:
- <downstream node name>
columns: # optional but recommended for key columns
- name: <column>
source_of: [ <downstream column(s)/metric(s)> ]
tests:
- <freshness/schema/unique/not_null/business tests>
change_log:
- <date>: <what changed and why>
Worked examples
Example 1 — CRM Leads to BI Conversion Rate
Goal: Document lineage for a BI Conversion Rate tile.
- Source: crm.leads_raw
- Staging: stg_leads (clean types, standardize status)
- Model: fct_lead_events (one row per lead per status transition)
- Mart: mart_marketing (agg_lead_conversion)
- Metric: conversion_rate = converted_leads / total_leads
- BI: Dashboard tile "Lead Conversion Rate"
Key column-level lineage:
- converted_leads derives from fct_lead_events.status = 'Converted'
- total_leads derives from count distinct of stg_leads.lead_id
Common pitfall: Missing a filter on active campaigns upstream can inflate conversion. Capture that in business_definition and tests.
Example 2 — App Events to Active Users Metric
- Source: events.raw_app_events
- Staging: stg_events (deduplicate, enforce required fields)
- Model: fct_sessions (sessionization)
- Mart: mart_product (daily_active_users)
- Metric: DAU = count(distinct user_id) where event_date = today
- BI: DAU chart
Column-level lineage:
- DAU.user_id ← stg_events.user_id (after dedupe rules)
- DAU.event_date ← stg_events.event_ts::date via fct_sessions rules
Note: Document the session timeout assumption (e.g., 30 minutes) in business_definition.
Example 3 — Payments to Revenue Dashboard
- Source: erp.payments_raw
- Staging: stg_payments
- Model: fct_payments (currency normalization)
- Mart: finance.revenue_daily
- Metric: revenue = sum(amount_usd) where status = 'captured'
- BI: Revenue dashboard (Total Revenue, Revenue by Region)
Column-level lineage:
- amount_usd ← stg_payments.amount_local × fx_rates.rate
- region ← customers.region from a join path (ensure join keys documented)
Risk: Changing FX source breaks comparability. Add a change_log entry and alert downstream owners before switching.
How to capture lineage in practice
- Inventory your BI assets: List dashboards and tiles that matter to stakeholders.
- Backtrack dependencies: For each tile/metric, trace the BI dataset → mart → model → staging → source.
- Record a minimal spec: For each node, fill the spec above. Start with dataset-level; add column-level for key metrics.
- Add ownership and SLAs: Document owner and freshness expectations to reduce firefighting.
- Automate where possible: Use your transformation tool’s compiled DAG and your BI’s dataset-to-dashboard mapping to keep docs current.
- Publish and review: Store the spec with your project and review on schema changes or sprint reviews.
Common mistakes and self-checks
- Mistake: Stopping at table-level only for critical KPIs. Self-check: For top 5 KPIs, do you know the exact column derivations and filters?
- Mistake: No owner or SLA. Self-check: Can you name a person/team for each mart and BI dataset, and a freshness target?
- Mistake: Hidden filters. Self-check: Are all KPI filters and exclusions documented in business_definition?
- Mistake: Missing change_log. Self-check: Can you tell when and why the metric changed last?
- Mistake: Not documenting join keys. Self-check: Are primary keys and join conditions recorded for major joins?
Practical projects
- Project 1: Choose one executive dashboard. Document end-to-end lineage for 3 tiles using the minimal spec.
- Project 2: Add column-level lineage for one KPI with currency conversion or time-window logic.
- Project 3: Implement a freshness test and display the SLA status in your BI dashboard description.
Exercises
These exercises mirror the tasks below in the page’s Exercises section. You can do them here and then check your answers.
Exercise 1 — Map lineage for a revenue KPI
Given nodes: orders_raw, customers_raw → stg_orders, stg_customers → fct_orders → fct_revenue → BI tile "Total Revenue".
Task: List nodes and edges from source to BI, and map the column-level lineage for the KPI value "total_revenue_usd" visible in BI.
Template:
Nodes: [...]
Edges: upstream -> downstream [...]
Column mapping:
BI.total_revenue_usd <- ...
joins & filters: ...
SLA/Owner notes: ...
Exercise 2 — Write a one-page lineage spec
For metric "Active Subscribers (28d)", write a YAML-like spec: dependencies, business definition, key columns, tests, and ownership.
- [ ] Document upstream and downstream for each BI tile.
- [ ] Capture owners and freshness SLAs.
- [ ] Note business definitions and filters for each KPI.
- [ ] Include join keys for any joins.
- [ ] Add at least one test per critical node.
Progress saving note: Anyone can take the test and do exercises for free. If you log in, your progress and answers will be saved.
Mini challenge
Pick one KPI from your environment and produce a single screenshot-worthy lineage: 6–10 nodes, arrows from source to BI, and a 3-line business definition. Keep it under 15 minutes. Share with your team for feedback.
Learning path
- Start: Document one BI tile end-to-end (dataset-level).
- Next: Add column-level lineage for its key metric.
- Then: Add owners, SLAs, and a freshness test.
- Finally: Repeat for the top 5 KPIs and schedule periodic reviews.
Next steps
- Choose your first dashboard and apply the minimal spec.
- Automate partial lineage using your transformation DAG and BI datasets.
- Add change_log entries when KPIs or upstream sources change.