Menu

Topic 2 of 8

Technical Metadata Collection

Learn Technical Metadata Collection for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Who this is for

  • Data Architects defining enterprise data platforms and governance.
  • Data Engineers implementing catalogs, lineage, and observability pipelines.

Prerequisites

  • Comfort with SQL and basic DB concepts (schemas, tables, columns, constraints).
  • Familiarity with at least one data processing stack (e.g., Spark, Airflow, dbt, cloud warehouses).
  • Basic understanding of data lineage concepts (datasets, jobs, inputs, outputs).

Why this matters

Technical metadata tells you what exists, where it lives, how it was produced, and how it is changing. As a Data Architect, you will:

  • Enable impact analysis by tracking which dashboards break if a column changes.
  • Reduce audit risk by proving data origins and transformations.
  • Improve developer speed with a trustworthy catalog and auto-discovery.
  • Support cost and performance tuning using runtime stats and job parameters.

Concept explained simply

Technical metadata is the facts about your data assets and pipelines: schemas, locations, partitions, job configs, runtime stats, and dependencies. It is not the business meaning of data—that’s business metadata—but the technical truth that tools can collect automatically.

Mental model

Think of your platform as a city:

  • Datasets are buildings (tables, files, views).
  • Fields are rooms (columns, data types, constraints).
  • Processes are roads and utilities (ETL/ELT jobs, queries).
  • Runs are scheduled traffic (each execution with timings and status).

Technical metadata is the city map plus traffic telemetry collected automatically.

Key types of technical metadata

  • Storage and location: system, database, schema, table, file paths, partitions.
  • Schema: columns, data types, nullability, constraints, primary/foreign keys.
  • Execution: job definitions, parameters, code version, owners.
  • Runtime: start/end, status, row counts, bytes read/written, cluster size.
  • Lineage: inputs, outputs, and column-level mappings where possible.

Collection strategies

  • Pull collection: Scheduled scans of information_schema, system catalogs, cloud APIs, orchestration logs.
  • Push collection: Emit events at job runtime from pipelines (e.g., on job start/end).
  • Active vs passive: Active instruments jobs to emit events; passive parses logs and queries after the fact.
  • Frequency: Real-time for runs; hourly/daily for schemas; weekly for slow-changing systems.
  • Scope and filtering: Only collect selected databases, schemas, or prefixes to avoid noise.
  • Incremental updates: Use change timestamps, version columns, or run IDs to avoid full rescans.

Data model and standardization

Unify collected metadata into a common model so different tools can interoperate. A practical neutral model includes:

  • DataSet: name, platform, namespace, physical location, facets (e.g., schema, storage).
  • Field: dataset, field name, type, nullable, constraints, tags.
  • Process: logical job (e.g., daily_sales_aggregation) with code location and owner.
  • Run: one execution with start, end, status, metrics.
  • Lineage: inputs, outputs, optional column mappings.
Example neutral schema facets
  • schema_facet: list of fields {name, type, nullable, description}
  • storage_facet: {location, format, partitioning}
  • ownership_facet: {team, owner}
  • runtime_facet: {engine, version, cluster}
  • stats_facet: {rowCount, bytes, executionMs}

Worked examples

Example 1: Pull schema from a relational warehouse

Goal: Capture tables and columns from a warehouse via information_schema.

Steps
  1. List tables from information_schema.tables filtered by target schemas.
  2. Fetch columns from information_schema.columns for those tables.
  3. Normalize into DataSet and Field records.
Illustrative queries
-- Tables
SELECT table_catalog, table_schema, table_name
FROM information_schema.tables
WHERE table_type='BASE TABLE' AND table_schema IN ('analytics','core');

-- Columns
SELECT table_schema, table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema IN ('analytics','core');
Resulting normalized records (excerpt)
{
  "dataset": {"platform":"warehouse","namespace":"analytics","name":"orders"},
  "schema_facet": {
    "fields": [
      {"name":"order_id","type":"bigint","nullable":false},
      {"name":"order_ts","type":"timestamp","nullable":false},
      {"name":"amount","type":"numeric(12,2)","nullable":false}
    ]
  }
}

Example 2: Capture lineage from SQL query logs

Goal: Parse a materialization query to map inputs to outputs.

Input
CREATE OR REPLACE TABLE analytics.daily_sales AS
SELECT d.date, s.store_id, SUM(o.amount) AS total
FROM core.orders o
JOIN core.stores s ON o.store_id = s.id
JOIN core.dates d ON d.id = o.date_id
WHERE o.status = 'COMPLETED'
GROUP BY d.date, s.store_id;
Approach
  • Detect output dataset: analytics.daily_sales
  • Detect inputs: core.orders, core.stores, core.dates
  • Infer column lineage: daily_sales.total depends on orders.amount and filter status
Normalized lineage (simplified)
{
  "process": {"name":"materialize_daily_sales"},
  "run": {"id":"run-2025-05-01T01:00Z","status":"COMPLETED"},
  "inputs":["core.orders","core.stores","core.dates"],
  "outputs":["analytics.daily_sales"],
  "columnLineage": {
    "analytics.daily_sales.total": ["core.orders.amount"],
    "analytics.daily_sales.date": ["core.dates.date"],
    "analytics.daily_sales.store_id": ["core.stores.store_id"]
  }
}

Example 3: Collect file dataset metadata from object storage + catalog

Goal: Enumerate S3 data lake tables using a catalog (e.g., Glue/Unity/HMS) and capture partitions.

Steps
  1. List tables via catalog API for database "raw".
  2. For each table, fetch storage descriptor: location, format, partition keys.
  3. Optionally sample a partition to infer column types if not provided.
Normalized record
{
  "dataset": {"platform":"s3","namespace":"raw","name":"events"},
  "storage_facet": {"location":"s3://lake/raw/events/","format":"parquet","partitioning":["ingest_date"]},
  "schema_facet": {"fields":[{"name":"event_id","type":"string","nullable":false},{"name":"ingest_date","type":"date","nullable":false}]}
}

Implementation checklist

  • [ ] Define authoritative sources: which warehouses, catalogs, orchestrators, repos.
  • [ ] Choose pull vs push per source; prefer push for run events, pull for schemas.
  • [ ] Model: datasets, fields, processes, runs, lineage; agree on identifiers.
  • [ ] Scope filters: include-only schemas, prefixes, tags.
  • [ ] Scheduling: set frequencies and SLAs; align with platform change cadence.
  • [ ] Incrementals: use updated_at, run IDs; avoid full scans.
  • [ ] Quality gates: row counts, schema drift detection, missing lineage alerts.
  • [ ] Governance: owners, privacy tags, retention for metadata itself.

Exercises

Practice here, then check solutions. The Quick Test at the end is available to everyone; only logged-in users will have progress saved.

Exercise 1 — Normalize warehouse schemas

Given tables core.customers(id bigint not null, email text not null, created_at timestamp not null) and analytics.orders_by_day(order_date date not null, orders int not null), produce normalized records for datasets and fields using the neutral model described above.

Expected structure
{
  "dataset": {"platform":"warehouse","namespace":"SCHEMA","name":"TABLE"},
  "schema_facet": {"fields":[{"name":"...","type":"...","nullable":false}]}
}
Hints
  • Use namespace = schema name (core, analytics).
  • Data types can be kept as-is; ensure nullable flags are correct.
Show solution
{
  "records": [
    {
      "dataset": {"platform":"warehouse","namespace":"core","name":"customers"},
      "schema_facet": {
        "fields": [
          {"name":"id","type":"bigint","nullable":false},
          {"name":"email","type":"text","nullable":false},
          {"name":"created_at","type":"timestamp","nullable":false}
        ]
      }
    },
    {
      "dataset": {"platform":"warehouse","namespace":"analytics","name":"orders_by_day"},
      "schema_facet": {
        "fields": [
          {"name":"order_date","type":"date","nullable":false},
          {"name":"orders","type":"int","nullable":false}
        ]
      }
    }
  ]
}

Exercise 2 — Build a simple lineage event

A job daily_orders creates analytics.orders_by_day from core.orders and core.dates using a GROUP BY on order_date. Create a lineage event with process, run, inputs, outputs, and columnLineage for orders_by_day.orders derived from core.orders.id.

Expected structure
{
  "process": {"name":"..."},
  "run": {"id":"...","status":"..."},
  "inputs":["..."],
  "outputs":["..."],
  "columnLineage": {"analytics.orders_by_day.orders":["..."]}
}
Hints
  • Use a plausible run ID like run-YYYYMMDD.
  • Include both inputs (orders and dates), even if not all map to the aggregated column.
Show solution
{
  "process": {"name":"daily_orders"},
  "run": {"id":"run-2025-06-01","status":"COMPLETED"},
  "inputs": ["core.orders","core.dates"],
  "outputs": ["analytics.orders_by_day"],
  "columnLineage": {
    "analytics.orders_by_day.orders": ["core.orders.id"]
  }
}

Common mistakes and self-check

  • Collecting everything: Leads to noise and cost. Self-check: Do we have include-lists for schemas and databases?
  • No stable identifiers: Names change; use platform/namespace/name triplets or GUIDs. Self-check: Can we uniquely identify the same table across environments?
  • Ignoring incrementals: Full scans are slow. Self-check: Do our sources expose updated_at or versioning?
  • Run events without context: A run without process or datasets is not actionable. Self-check: Does each run link to a process and its inputs/outputs?
  • Column lineage overreach: If you can’t reliably parse, keep dataset-level lineage first. Self-check: Are we marking lineage confidence?

Practical projects

  • Warehouse crawler: Pull schemas nightly from your warehouse and materialize a searchable catalog page.
  • Run event emitter: Add a lightweight emitter to one ETL job to push start/end, row counts, and inputs/outputs.
  • Lineage from logs: Parse one orchestrator’s logs to build dataset-level lineage for three pipelines.

Learning path

  • First: Technical metadata collection (this page) to build your base catalog and run telemetry.
  • Next: Enrich with business metadata and ownership for discoverability.
  • Then: Column-level lineage and data quality signals for trust and impact analysis.
  • Finally: Access policies and lifecycle management based on metadata.

Mini challenge

Instrument one pipeline end-to-end. Acceptance criteria:

  • Datasets (inputs/outputs) registered with schema.
  • Run event recorded with start/end, status, row counts.
  • Dataset-level lineage present; column-level for at least one column.
  • Incremental schema refresh set to daily for touched schemas.
Tip

Start narrow: one schema, one job. Validate quality and performance before scaling.

Next steps

  • Pick one system to crawl (warehouse or data lake) and implement a minimal but reliable pull.
  • Add push-based run events to one job for fresher lineage.
  • Automate checks for schema drift and missing lineage.

Quick Test

Take the quick test below. Anyone can take it; only logged-in users will have results saved.

Practice Exercises

2 exercises to complete

Instructions

Given tables core.customers(id bigint not null, email text not null, created_at timestamp not null) and analytics.orders_by_day(order_date date not null, orders int not null), produce normalized records for datasets and fields using the neutral model.

Follow the structure: dataset(platform, namespace, name) and schema_facet(fields...).

Expected Output
{ "records": [ { "dataset": {"platform":"warehouse","namespace":"core","name":"customers"}, "schema_facet": {"fields": [ {"name":"id","type":"bigint","nullable":false}, {"name":"email","type":"text","nullable":false}, {"name":"created_at","type":"timestamp","nullable":false} ]} }, { "dataset": {"platform":"warehouse","namespace":"analytics","name":"orders_by_day"}, "schema_facet": {"fields": [ {"name":"order_date","type":"date","nullable":false}, {"name":"orders","type":"int","nullable":false} ]} } ] }

Technical Metadata Collection — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Technical Metadata Collection?

AI Assistant

Ask questions about this tool