How to learn Metadata Collection And Lineage for Data Catalog And Governance in Data Platform Engineer for free

Why this matters

As a Data Platform Engineer, you enable people to trust and find data. Metadata and lineage show what a dataset is, where it came from, who owns it, and how it changes. Real tasks you will do:

Auto-scan schemas and column types when new datasets appear.
Track which jobs and queries produced a table and which dashboards use it.
Answer impact questions fast: “If we change column X, what breaks?”
Prove compliance: who accessed PII, how it flowed, and where it’s stored.

Note on progress: The quick test and exercises are available to everyone. Only logged-in users get saved progress.

Concept explained simply

Metadata is “data about data.” Lineage is the map of how data moves and transforms across systems.

Technical metadata: schemas, data types, sizes, partitions.
Operational metadata: job runs, durations, failures, data freshness.
Business metadata: descriptions, owners, tags, sensitivity (e.g., PII).
Lineage: nodes (datasets, jobs) and edges (reads/writes). Can be table-level or column-level.

Mental model

Think of your platform as a transit system:

Stations = datasets.
Trains = jobs/queries.
Timetables = schedules and run history.
Map = lineage graph. You use it to plan routes (impact analysis) and avoid delays (data incidents).

Collection methods

Pull-based scanners: periodically read catalogs, information_schema, file manifests.
Push-based emitters: jobs emit events about reads/writes at runtime.
SQL parsing: extract lineage by parsing CREATE/INSERT/SELECT statements.
Log and DAG parsing: read orchestration logs or DAG definitions to infer edges.
Classifiers: detect PII via names/patterns and tag columns.
Change capture: store versions over time (what changed and when).

Glossary (open to view)

Entity: a dataset, column, job, or dashboard.
Edge: a dependency between entities (read/write).
Run: a single execution of a job or query with start/end times.
Upstream/Downstream: sources vs consumers of a dataset.

Worked examples

Example 1: Basic technical metadata scan

You run a nightly scanner over two tables. It records schemas and owners.

{
  "datasets": [
    {
      "name": "stg.orders",
      "owner": "data.ops",
      "columns": [
        {"name": "order_id", "type": "INT"},
        {"name": "order_ts", "type": "TIMESTAMP"},
        {"name": "customer_id", "type": "INT"},
        {"name": "amount", "type": "DECIMAL(10,2)"}
      ],
      "last_seen": "2026-01-10T23:50:00Z"
    },
    {
      "name": "stg.customers",
      "owner": "data.ops",
      "columns": [
        {"name": "customer_id", "type": "INT"},
        {"name": "email", "type": "STRING", "tags": ["pii"]},
        {"name": "country", "type": "STRING"}
      ],
      "last_seen": "2026-01-10T23:50:00Z"
    }
  ]
}

Outcome: A searchable inventory with schemas, timestamps, and sensitivity tags.

Example 2: Column-level lineage from SQL

SQL:

CREATE OR REPLACE TABLE rpt.daily_sales AS
SELECT 
  o.order_id,
  o.order_ts::DATE AS order_date,
  o.amount,
  c.country
FROM stg.orders o
JOIN stg.customers c ON o.customer_id = c.customer_id;

Derived lineage edges:

rpt.daily_sales.order_id <- stg.orders.order_id
rpt.daily_sales.order_date <- stg.orders.order_ts
rpt.daily_sales.amount <- stg.orders.amount
rpt.daily_sales.country <- stg.customers.country

Value: Impact analysis for schema changes and sensitive data propagation.

Example 3: Runtime lineage from a job run

{
  "run_id": "2026-01-10-ETL-001",
  "job": "build_daily_sales",
  "reads": ["stg.orders", "stg.customers"],
  "writes": ["rpt.daily_sales"],
  "started": "2026-01-10T01:00:00Z",
  "ended": "2026-01-10T01:07:32Z",
  "status": "success"
}

Value: Freshness and reliability signals. You can overlay runtime runs on static SQL-derived lineage.

Design choices that matter

Granularity: start with table-level lineage; extend to column-level for critical tables.
Event vs batch: emit read/write events at runtime for accuracy; batch-scan for coverage.
Time-awareness: store versions (valid_from, valid_to) to answer “what was true last week?”
Identity: standardize dataset IDs (platform.schema.table) and job IDs (org.project.task).

Minimal metadata schema (suggestion)

{
  "dataset": {"id": "string", "platform": "string", "schema": "string", "name": "string"},
  "column": {"dataset_id": "string", "name": "string", "type": "string", "tags": ["string"]},
  "job": {"id": "string", "name": "string"},
  "run": {"id": "string", "job_id": "string", "started": "datetime", "ended": "datetime", "status": "string"},
  "edge": {"from_id": "string", "to_id": "string", "level": "table|column", "run_id": "string|null"}
}

Step-by-step starter plan

Inventory: scan datasets and columns, capture owners and tags.
Table-level lineage: parse CREATE/INSERT and logs to connect reads to writes.
Column-level for key tables: add a basic SQL mapper for SELECT aliases and casts.
Runs and freshness: capture job/query runs to add time and status context.
Quality signals: attach row counts, null ratios, and test results as metadata.

Exercises

Do these to build confidence. The same exercises are listed in the Exercises panel below.

Exercise 1: Build a tiny data inventory

Create a compact YAML inventory for two tables using the sample schemas below.

Sample schemas

stg.orders(
  order_id INT,
  order_ts TIMESTAMP,
  customer_id INT,
  amount DECIMAL(10,2)
)

stg.customers(
  customer_id INT,
  email STRING,
  country STRING
)

Include: dataset name, owner, columns (name, type), and tags (mark email as pii).

Deliverable: inventory.yaml with two datasets.
Self-check: both tables present, columns correct, pii tag set on email.

Exercise 2: Derive column lineage from SQL

Given the SQL in Example 2, produce a list of column-level lineage mappings for rpt.daily_sales.

Deliverable: a 4-line mapping as shown in Example 2.
Self-check: each target column maps to exactly one source column, even when cast or renamed.

Implementation checklist

Unique IDs for datasets, columns, jobs, and runs.
Daily scanner job scheduled and logged.
Lineage edges persisted with timestamps.
Owners and sensitivity tags populated.
Ability to answer: “What feeds this?” and “What does this feed?”

Common mistakes

Only table-level lineage forever: fine to start, but add column-level for key marts.
No time dimension: without versions, you cannot audit historical states.
Unreliable IDs: changing names breaks lineage. Normalize IDs and add aliases.
Ignoring owners: metadata without ownership leads to stale catalogs.
Over-parsing complex SQL too early: start with 80% patterns (SELECT, JOIN, CAST), then iterate.

How to self-check

Pick one important table. Can you list all upstream tables and their last run times?
Delete a column in a sandbox and simulate impact. Do dashboards and jobs show up?
Can you trace a PII field from source to report?

Practical projects

Project A: Build a nightly metadata scanner for schemas and owners. Store results in a small relational store. Add a command to diff today vs yesterday.
Project B: Implement a simple SQL lineage parser that handles SELECT, aliases, casts, and JOINs. Output edges as JSON.
Project C: Add sensitivity classification rules (e.g., column name contains email or ssn) and auto-tag columns. Attach tags to lineage.

Who this is for

Data Platform Engineers building catalogs and governance features.
Data Stewards and Governance leads who manage ownership and sensitivity.

Prerequisites

Comfort with SQL (DDL and DML).
Basic understanding of batch jobs and scheduling.
Familiarity with JSON/YAML and REST-style event concepts (optional but helpful).

Learning path

Inventory metadata (schemas, owners, tags).
Table-level lineage from SQL and logs.
Column-level lineage for critical datasets.
Add runtime runs and freshness signals.
Automate classification and business glossary tags.

Next steps

Automate: schedule scans and lineage extraction with notifications on changes.
Harden: add versioned storage and backfills for historical lineage.
Adopt: socialize in your org—teach teams how to request tags, owners, and descriptions.

Mini challenge

You have two new tables and one report:

raw.clicks(click_id, user_id, url, ts)
raw.users(user_id, email, country)

CREATE TABLE rpt.daily_clicks AS
SELECT c.ts::DATE AS day, u.country, COUNT(*) AS clicks
FROM raw.clicks c JOIN raw.users u ON c.user_id = u.user_id
GROUP BY 1,2;

Tasks:

List table-level lineage edges.
List column-level mappings for day and country.
Mark any PII and show where it could propagate.

Tip: keep the answers short and explicit. Try to solve without peeking, then validate using the patterns above.

Menu

Metadata Collection And Lineage

Table of Contents