Menu

Topic 1 of 8

Metadata Collection And Lineage

Learn Metadata Collection And Lineage for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As a Data Platform Engineer, you enable people to trust and find data. Metadata and lineage show what a dataset is, where it came from, who owns it, and how it changes. Real tasks you will do:

  • Auto-scan schemas and column types when new datasets appear.
  • Track which jobs and queries produced a table and which dashboards use it.
  • Answer impact questions fast: “If we change column X, what breaks?”
  • Prove compliance: who accessed PII, how it flowed, and where it’s stored.

Note on progress: The quick test and exercises are available to everyone. Only logged-in users get saved progress.

Concept explained simply

Metadata is “data about data.” Lineage is the map of how data moves and transforms across systems.

  • Technical metadata: schemas, data types, sizes, partitions.
  • Operational metadata: job runs, durations, failures, data freshness.
  • Business metadata: descriptions, owners, tags, sensitivity (e.g., PII).
  • Lineage: nodes (datasets, jobs) and edges (reads/writes). Can be table-level or column-level.

Mental model

Think of your platform as a transit system:

  • Stations = datasets.
  • Trains = jobs/queries.
  • Timetables = schedules and run history.
  • Map = lineage graph. You use it to plan routes (impact analysis) and avoid delays (data incidents).

Collection methods

  • Pull-based scanners: periodically read catalogs, information_schema, file manifests.
  • Push-based emitters: jobs emit events about reads/writes at runtime.
  • SQL parsing: extract lineage by parsing CREATE/INSERT/SELECT statements.
  • Log and DAG parsing: read orchestration logs or DAG definitions to infer edges.
  • Classifiers: detect PII via names/patterns and tag columns.
  • Change capture: store versions over time (what changed and when).
Glossary (open to view)
  • Entity: a dataset, column, job, or dashboard.
  • Edge: a dependency between entities (read/write).
  • Run: a single execution of a job or query with start/end times.
  • Upstream/Downstream: sources vs consumers of a dataset.

Worked examples

Example 1: Basic technical metadata scan

You run a nightly scanner over two tables. It records schemas and owners.

{
  "datasets": [
    {
      "name": "stg.orders",
      "owner": "data.ops",
      "columns": [
        {"name": "order_id", "type": "INT"},
        {"name": "order_ts", "type": "TIMESTAMP"},
        {"name": "customer_id", "type": "INT"},
        {"name": "amount", "type": "DECIMAL(10,2)"}
      ],
      "last_seen": "2026-01-10T23:50:00Z"
    },
    {
      "name": "stg.customers",
      "owner": "data.ops",
      "columns": [
        {"name": "customer_id", "type": "INT"},
        {"name": "email", "type": "STRING", "tags": ["pii"]},
        {"name": "country", "type": "STRING"}
      ],
      "last_seen": "2026-01-10T23:50:00Z"
    }
  ]
}

Outcome: A searchable inventory with schemas, timestamps, and sensitivity tags.

Example 2: Column-level lineage from SQL

SQL:

CREATE OR REPLACE TABLE rpt.daily_sales AS
SELECT 
  o.order_id,
  o.order_ts::DATE AS order_date,
  o.amount,
  c.country
FROM stg.orders o
JOIN stg.customers c ON o.customer_id = c.customer_id;

Derived lineage edges:

  • rpt.daily_sales.order_id <- stg.orders.order_id
  • rpt.daily_sales.order_date <- stg.orders.order_ts
  • rpt.daily_sales.amount <- stg.orders.amount
  • rpt.daily_sales.country <- stg.customers.country

Value: Impact analysis for schema changes and sensitive data propagation.

Example 3: Runtime lineage from a job run

{
  "run_id": "2026-01-10-ETL-001",
  "job": "build_daily_sales",
  "reads": ["stg.orders", "stg.customers"],
  "writes": ["rpt.daily_sales"],
  "started": "2026-01-10T01:00:00Z",
  "ended": "2026-01-10T01:07:32Z",
  "status": "success"
}

Value: Freshness and reliability signals. You can overlay runtime runs on static SQL-derived lineage.

Design choices that matter

  • Granularity: start with table-level lineage; extend to column-level for critical tables.
  • Event vs batch: emit read/write events at runtime for accuracy; batch-scan for coverage.
  • Time-awareness: store versions (valid_from, valid_to) to answer “what was true last week?”
  • Identity: standardize dataset IDs (platform.schema.table) and job IDs (org.project.task).
Minimal metadata schema (suggestion)
{
  "dataset": {"id": "string", "platform": "string", "schema": "string", "name": "string"},
  "column": {"dataset_id": "string", "name": "string", "type": "string", "tags": ["string"]},
  "job": {"id": "string", "name": "string"},
  "run": {"id": "string", "job_id": "string", "started": "datetime", "ended": "datetime", "status": "string"},
  "edge": {"from_id": "string", "to_id": "string", "level": "table|column", "run_id": "string|null"}
}

Step-by-step starter plan

  1. Inventory: scan datasets and columns, capture owners and tags.
  2. Table-level lineage: parse CREATE/INSERT and logs to connect reads to writes.
  3. Column-level for key tables: add a basic SQL mapper for SELECT aliases and casts.
  4. Runs and freshness: capture job/query runs to add time and status context.
  5. Quality signals: attach row counts, null ratios, and test results as metadata.

Exercises

Do these to build confidence. The same exercises are listed in the Exercises panel below.

Exercise 1: Build a tiny data inventory

Create a compact YAML inventory for two tables using the sample schemas below.

Sample schemas
stg.orders(
  order_id INT,
  order_ts TIMESTAMP,
  customer_id INT,
  amount DECIMAL(10,2)
)

stg.customers(
  customer_id INT,
  email STRING,
  country STRING
)

Include: dataset name, owner, columns (name, type), and tags (mark email as pii).

  • Deliverable: inventory.yaml with two datasets.
  • Self-check: both tables present, columns correct, pii tag set on email.

Exercise 2: Derive column lineage from SQL

Given the SQL in Example 2, produce a list of column-level lineage mappings for rpt.daily_sales.

  • Deliverable: a 4-line mapping as shown in Example 2.
  • Self-check: each target column maps to exactly one source column, even when cast or renamed.

Implementation checklist

  • Unique IDs for datasets, columns, jobs, and runs.
  • Daily scanner job scheduled and logged.
  • Lineage edges persisted with timestamps.
  • Owners and sensitivity tags populated.
  • Ability to answer: “What feeds this?” and “What does this feed?”

Common mistakes

  • Only table-level lineage forever: fine to start, but add column-level for key marts.
  • No time dimension: without versions, you cannot audit historical states.
  • Unreliable IDs: changing names breaks lineage. Normalize IDs and add aliases.
  • Ignoring owners: metadata without ownership leads to stale catalogs.
  • Over-parsing complex SQL too early: start with 80% patterns (SELECT, JOIN, CAST), then iterate.
How to self-check
  • Pick one important table. Can you list all upstream tables and their last run times?
  • Delete a column in a sandbox and simulate impact. Do dashboards and jobs show up?
  • Can you trace a PII field from source to report?

Practical projects

  • Project A: Build a nightly metadata scanner for schemas and owners. Store results in a small relational store. Add a command to diff today vs yesterday.
  • Project B: Implement a simple SQL lineage parser that handles SELECT, aliases, casts, and JOINs. Output edges as JSON.
  • Project C: Add sensitivity classification rules (e.g., column name contains email or ssn) and auto-tag columns. Attach tags to lineage.

Who this is for

  • Data Platform Engineers building catalogs and governance features.
  • Data Stewards and Governance leads who manage ownership and sensitivity.

Prerequisites

  • Comfort with SQL (DDL and DML).
  • Basic understanding of batch jobs and scheduling.
  • Familiarity with JSON/YAML and REST-style event concepts (optional but helpful).

Learning path

  1. Inventory metadata (schemas, owners, tags).
  2. Table-level lineage from SQL and logs.
  3. Column-level lineage for critical datasets.
  4. Add runtime runs and freshness signals.
  5. Automate classification and business glossary tags.

Next steps

  • Automate: schedule scans and lineage extraction with notifications on changes.
  • Harden: add versioned storage and backfills for historical lineage.
  • Adopt: socialize in your org—teach teams how to request tags, owners, and descriptions.

Mini challenge

You have two new tables and one report:

raw.clicks(click_id, user_id, url, ts)
raw.users(user_id, email, country)

CREATE TABLE rpt.daily_clicks AS
SELECT c.ts::DATE AS day, u.country, COUNT(*) AS clicks
FROM raw.clicks c JOIN raw.users u ON c.user_id = u.user_id
GROUP BY 1,2;

Tasks:

  • List table-level lineage edges.
  • List column-level mappings for day and country.
  • Mark any PII and show where it could propagate.

Tip: keep the answers short and explicit. Try to solve without peeking, then validate using the patterns above.

Practice Exercises

2 exercises to complete

Instructions

Create inventory.yaml that lists two datasets with owner, columns, and tags.

Sample schemas
stg.orders(
  order_id INT,
  order_ts TIMESTAMP,
  customer_id INT,
  amount DECIMAL(10,2)
)

stg.customers(
  customer_id INT,
  email STRING,
  country STRING
)
  • Include dataset names: stg.orders, stg.customers
  • Owner: data.ops
  • Tag email as pii
Expected Output
A valid YAML file with two datasets, each with columns and types; email column tagged pii.

Metadata Collection And Lineage — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Metadata Collection And Lineage?

AI Assistant

Ask questions about this tool