Why this matters
As a Data Platform Engineer, you enable people to trust and find data. Metadata and lineage show what a dataset is, where it came from, who owns it, and how it changes. Real tasks you will do:
- Auto-scan schemas and column types when new datasets appear.
- Track which jobs and queries produced a table and which dashboards use it.
- Answer impact questions fast: “If we change column X, what breaks?”
- Prove compliance: who accessed PII, how it flowed, and where it’s stored.
Note on progress: The quick test and exercises are available to everyone. Only logged-in users get saved progress.
Concept explained simply
Metadata is “data about data.” Lineage is the map of how data moves and transforms across systems.
- Technical metadata: schemas, data types, sizes, partitions.
- Operational metadata: job runs, durations, failures, data freshness.
- Business metadata: descriptions, owners, tags, sensitivity (e.g., PII).
- Lineage: nodes (datasets, jobs) and edges (reads/writes). Can be table-level or column-level.
Mental model
Think of your platform as a transit system:
- Stations = datasets.
- Trains = jobs/queries.
- Timetables = schedules and run history.
- Map = lineage graph. You use it to plan routes (impact analysis) and avoid delays (data incidents).
Collection methods
- Pull-based scanners: periodically read catalogs, information_schema, file manifests.
- Push-based emitters: jobs emit events about reads/writes at runtime.
- SQL parsing: extract lineage by parsing CREATE/INSERT/SELECT statements.
- Log and DAG parsing: read orchestration logs or DAG definitions to infer edges.
- Classifiers: detect PII via names/patterns and tag columns.
- Change capture: store versions over time (what changed and when).
Glossary (open to view)
- Entity: a dataset, column, job, or dashboard.
- Edge: a dependency between entities (read/write).
- Run: a single execution of a job or query with start/end times.
- Upstream/Downstream: sources vs consumers of a dataset.
Worked examples
Example 1: Basic technical metadata scan
You run a nightly scanner over two tables. It records schemas and owners.
{
"datasets": [
{
"name": "stg.orders",
"owner": "data.ops",
"columns": [
{"name": "order_id", "type": "INT"},
{"name": "order_ts", "type": "TIMESTAMP"},
{"name": "customer_id", "type": "INT"},
{"name": "amount", "type": "DECIMAL(10,2)"}
],
"last_seen": "2026-01-10T23:50:00Z"
},
{
"name": "stg.customers",
"owner": "data.ops",
"columns": [
{"name": "customer_id", "type": "INT"},
{"name": "email", "type": "STRING", "tags": ["pii"]},
{"name": "country", "type": "STRING"}
],
"last_seen": "2026-01-10T23:50:00Z"
}
]
}Outcome: A searchable inventory with schemas, timestamps, and sensitivity tags.
Example 2: Column-level lineage from SQL
SQL:
CREATE OR REPLACE TABLE rpt.daily_sales AS SELECT o.order_id, o.order_ts::DATE AS order_date, o.amount, c.country FROM stg.orders o JOIN stg.customers c ON o.customer_id = c.customer_id;
Derived lineage edges:
- rpt.daily_sales.order_id <- stg.orders.order_id
- rpt.daily_sales.order_date <- stg.orders.order_ts
- rpt.daily_sales.amount <- stg.orders.amount
- rpt.daily_sales.country <- stg.customers.country
Value: Impact analysis for schema changes and sensitive data propagation.
Example 3: Runtime lineage from a job run
{
"run_id": "2026-01-10-ETL-001",
"job": "build_daily_sales",
"reads": ["stg.orders", "stg.customers"],
"writes": ["rpt.daily_sales"],
"started": "2026-01-10T01:00:00Z",
"ended": "2026-01-10T01:07:32Z",
"status": "success"
}Value: Freshness and reliability signals. You can overlay runtime runs on static SQL-derived lineage.
Design choices that matter
- Granularity: start with table-level lineage; extend to column-level for critical tables.
- Event vs batch: emit read/write events at runtime for accuracy; batch-scan for coverage.
- Time-awareness: store versions (valid_from, valid_to) to answer “what was true last week?”
- Identity: standardize dataset IDs (platform.schema.table) and job IDs (org.project.task).
Minimal metadata schema (suggestion)
{
"dataset": {"id": "string", "platform": "string", "schema": "string", "name": "string"},
"column": {"dataset_id": "string", "name": "string", "type": "string", "tags": ["string"]},
"job": {"id": "string", "name": "string"},
"run": {"id": "string", "job_id": "string", "started": "datetime", "ended": "datetime", "status": "string"},
"edge": {"from_id": "string", "to_id": "string", "level": "table|column", "run_id": "string|null"}
}Step-by-step starter plan
- Inventory: scan datasets and columns, capture owners and tags.
- Table-level lineage: parse CREATE/INSERT and logs to connect reads to writes.
- Column-level for key tables: add a basic SQL mapper for SELECT aliases and casts.
- Runs and freshness: capture job/query runs to add time and status context.
- Quality signals: attach row counts, null ratios, and test results as metadata.
Exercises
Do these to build confidence. The same exercises are listed in the Exercises panel below.
Exercise 1: Build a tiny data inventory
Create a compact YAML inventory for two tables using the sample schemas below.
Sample schemas
stg.orders( order_id INT, order_ts TIMESTAMP, customer_id INT, amount DECIMAL(10,2) ) stg.customers( customer_id INT, email STRING, country STRING )
Include: dataset name, owner, columns (name, type), and tags (mark email as pii).
- Deliverable: inventory.yaml with two datasets.
- Self-check: both tables present, columns correct, pii tag set on email.
Exercise 2: Derive column lineage from SQL
Given the SQL in Example 2, produce a list of column-level lineage mappings for rpt.daily_sales.
- Deliverable: a 4-line mapping as shown in Example 2.
- Self-check: each target column maps to exactly one source column, even when cast or renamed.
Implementation checklist
- Unique IDs for datasets, columns, jobs, and runs.
- Daily scanner job scheduled and logged.
- Lineage edges persisted with timestamps.
- Owners and sensitivity tags populated.
- Ability to answer: “What feeds this?” and “What does this feed?”
Common mistakes
- Only table-level lineage forever: fine to start, but add column-level for key marts.
- No time dimension: without versions, you cannot audit historical states.
- Unreliable IDs: changing names breaks lineage. Normalize IDs and add aliases.
- Ignoring owners: metadata without ownership leads to stale catalogs.
- Over-parsing complex SQL too early: start with 80% patterns (SELECT, JOIN, CAST), then iterate.
How to self-check
- Pick one important table. Can you list all upstream tables and their last run times?
- Delete a column in a sandbox and simulate impact. Do dashboards and jobs show up?
- Can you trace a PII field from source to report?
Practical projects
- Project A: Build a nightly metadata scanner for schemas and owners. Store results in a small relational store. Add a command to diff today vs yesterday.
- Project B: Implement a simple SQL lineage parser that handles SELECT, aliases, casts, and JOINs. Output edges as JSON.
- Project C: Add sensitivity classification rules (e.g., column name contains email or ssn) and auto-tag columns. Attach tags to lineage.
Who this is for
- Data Platform Engineers building catalogs and governance features.
- Data Stewards and Governance leads who manage ownership and sensitivity.
Prerequisites
- Comfort with SQL (DDL and DML).
- Basic understanding of batch jobs and scheduling.
- Familiarity with JSON/YAML and REST-style event concepts (optional but helpful).
Learning path
- Inventory metadata (schemas, owners, tags).
- Table-level lineage from SQL and logs.
- Column-level lineage for critical datasets.
- Add runtime runs and freshness signals.
- Automate classification and business glossary tags.
Next steps
- Automate: schedule scans and lineage extraction with notifications on changes.
- Harden: add versioned storage and backfills for historical lineage.
- Adopt: socialize in your org—teach teams how to request tags, owners, and descriptions.
Mini challenge
You have two new tables and one report:
raw.clicks(click_id, user_id, url, ts) raw.users(user_id, email, country) CREATE TABLE rpt.daily_clicks AS SELECT c.ts::DATE AS day, u.country, COUNT(*) AS clicks FROM raw.clicks c JOIN raw.users u ON c.user_id = u.user_id GROUP BY 1,2;
Tasks:
- List table-level lineage edges.
- List column-level mappings for day and country.
- Mark any PII and show where it could propagate.
Tip: keep the answers short and explicit. Try to solve without peeking, then validate using the patterns above.