Why this matters for a Data Architect
Metadata and lineage architecture defines how your organization understands, governs, and trusts its data. As a Data Architect, you design the standards, services, and processes that make data discoverable, well-documented, and traceable from source to report. This reduces risk, speeds analytics delivery, and enables compliance.
- Trust: Lineage shows where metrics come from and what transformations occurred.
- Governance: Metadata enables access control, classification (e.g., PII), and auditability.
- Productivity: Catalogs and search help teams find and reuse datasets faster.
- Reliability: Schema registries and versioning prevent breaking changes.
What you will learn
- Design a metadata model covering business, technical, and operational metadata.
- Implement a business glossary and link terms to technical assets.
- Collect technical metadata from databases, pipelines, and BI tools.
- Capture end-to-end lineage from source to report.
- Use schema registries and dataset documentation standards.
- Enable discoverability, automated documentation, and governance workflows.
Who this is for
- Data Architects defining platform standards and governance.
- Senior Data Engineers building pipelines and tooling.
- Analytics Engineers and BI Leads needing traceable metrics.
- Data Stewards and Governance practitioners.
Prerequisites
- Solid SQL and data warehouse concepts (tables, views, partitions, materializations).
- Basic data modeling (star schema, data vault or entity modeling).
- Familiarity with ETL/ELT tools and scheduling/orchestration.
- Awareness of data governance basics (ownership, stewardship, classification).
Learning path
- Frame the vision: Define goals: trust, compliance, discoverability, productivity. Choose scope (e.g., top 50 datasets).
- Model the metadata: Entities like Dataset, Table, Column, Field, Term, Owner, Policy, Job, Run, LineageEdge, Tag.
- Establish standards: Dataset documentation template, glossary rules, schema versioning policy, naming conventions.
- Collect technical metadata: From databases (information_schema), pipelines (job run metadata), BI (reports/dashboards).
- Implement lineage: Capture at job boundaries; normalize to a standard graph (nodes and edges); include BI lineage.
- Enable discoverability: Index metadata, add search facets, ownership, and quality signals.
- Automate documentation: Generate docs on commit or deployment; enforce checks in CI for schema and glossary links.
- Governance workflows: Ownership assignment, term approvals, change management, deprecation and impact analysis.
- Operationalize: SLAs for freshness and completeness; alerts when lineage breaks or schemas change.
Worked examples
1) Collect table and column metadata from a SQL warehouse
Use the platform's information schema to extract technical metadata and load it into your metadata store.
-- Example: enumerate tables and columns with owners and timestamps
SELECT
t.table_schema,
t.table_name,
c.column_name,
c.data_type,
t.table_type,
t.creation_time,
t.last_altered,
COALESCE(p.parameter_value, 'unknown') AS owner
FROM information_schema.tables t
JOIN information_schema.columns c
ON c.table_schema = t.table_schema AND c.table_name = t.table_name
LEFT JOIN metadata.parameters p
ON p.object_type = 'table'
AND p.object_schema = t.table_schema
AND p.object_name = t.table_name
AND p.parameter_key = 'owner';
Tip: Normalize names to a canonical form like system://schema.table and include source_system for cross-platform uniqueness.
2) Define and evolve a schema in a registry
Register schemas for streaming or batch exchange and enforce compatibility.
// JSON Schema example for a Customer event (v1)
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "CustomerCreated",
"type": "object",
"properties": {
"customer_id": {"type": "string"},
"email": {"type": "string", "format": "email"},
"created_at": {"type": "string", "format": "date-time"}
},
"required": ["customer_id", "email", "created_at"]
}
// v2 adds nullable phone with default (backward-compatible)
{
"properties": {
"phone": {"type": ["string", "null"], "default": null}
}
}
Policy: Only additive changes without altering required fields are allowed for backward compatibility. Record versions and change notes.
3) Emit lineage from an ETL job
Capture inputs, outputs, and run context as lineage events.
# Pseudo-Python dict representing a lineage event
lineage_event = {
"job": {"name": "daily_orders_aggregate", "namespace": "warehouse"},
"run": {"runId": "2026-01-18T01:23:00Z"},
"inputs": [
{"name": "raw.orders", "namespace": "db://prod"},
{"name": "raw.customers", "namespace": "db://prod"}
],
"outputs": [
{"name": "mart.daily_orders", "namespace": "db://prod"}
],
"facets": {
"sql": {"query": "INSERT INTO mart.daily_orders ..."},
"dataQuality": {"rowCount": 134523, "nullChecks": {"customer_id": 0}}
}
}
# Send to your metadata service; failures should not block the pipeline but alert.
4) Map source-to-report lineage
Describe lineage in a simple YAML that your service parses into a graph.
nodes:
- id: db://prod.raw.orders
type: table
- id: db://prod.stg.orders_clean
type: table
- id: db://prod.mart.daily_orders
type: table
- id: bi://sales.dashboard.revenue
type: dashboard
edges:
- from: db://prod.raw.orders
to: db://prod.stg.orders_clean
via_job: etl.orders_clean
- from: db://prod.stg.orders_clean
to: db://prod.mart.daily_orders
via_job: etl.daily_orders_aggregate
- from: db://prod.mart.daily_orders
to: bi://sales.dashboard.revenue
via_job: bi.publish
Ensure BI lineage is included, not just data warehouse nodes, so users can trace metrics to raw sources.
5) Link business glossary terms to technical fields
Bind terms like Customer, Active Customer, or Net Revenue to columns and metrics.
term: Active Customer
definition: A customer with at least one order in the last 90 days.
owners: data_governance@sales
linked_assets:
- column: db://prod.mart.customers.active_flag
- metric: bi://sales.metrics.active_customers_90d
review_cycle_days: 180
Rule: No dataset marked "certified" unless all key metrics link to approved glossary terms.
Drills and exercises
- List the minimal metadata you need for a new dataset (owners, SLA, classification, tags, description, last_updated, upstreams).
- Propose a naming convention for assets that encodes environment and domain (e.g., env.domain.schema.table).
- Evaluate whether renaming a required JSON field is backward-compatible in your registry policy.
- Trace lineage for the "Revenue" metric and identify its immediate upstream datasets.
- Draft a deprecation plan for a dataset used by two dashboards with minimal impact.
- Add PII classifications to three columns and define access rules.
Common mistakes and debugging tips
- Only technical metadata, no business context: Add glossary terms, owners, and use-cases. Encourage stewards to maintain definitions.
- No schema versioning: Register versions and enforce compatibility. Add CI checks to block breaking changes.
- Missing BI lineage: Include report- and metric-level edges; otherwise, analysts cannot trace metrics to sources.
- Manual-only documentation: Automate extraction from code and systems; keep humans focused on definitions and decisions.
- Unsecured metadata store: Treat metadata as sensitive. Apply least-privilege, audit logs, and backups.
- Unbounded scope: Start with top-value datasets and expand incrementally.
Debugging lineage gaps
- Check if a job emitted lineage for the run time in question.
- Verify namespace and naming conventions match across systems.
- Confirm BI extraction ran after the dashboard refresh.
- Inspect parsing failures in your lineage ingestion logs.
Mini project: Data Catalog MVP with End-to-End Lineage
- Scope: Choose one domain (e.g., Sales) and 10 critical datasets.
- Model: Define entities and JSON schemas for Dataset, Column, Owner, Term, Job, Run, LineageEdge.
- Ingest: Write a small script to read information_schema and produce JSON records for datasets and columns.
- Glossary: Define 5–10 terms; link each to at least one dataset column and one metric.
- Lineage: Produce a YAML or JSON file with edges from raw to mart to BI for two dashboards.
- Docs: Generate dataset READMEs from a template using your metadata (owners, SLA, freshness, sample queries).
- Discoverability: Build a simple search over name, tags, and owners; add badges like certified/experimental.
- Review: Run an impact analysis report for deprecating one upstream table; share with stakeholders.
Dataset README template
# Dataset: mart.daily_orders
Owner: sales-data-team
SLA: refreshed daily by 06:00 UTC
Classification: internal, contains PII=false
Description: Daily aggregated orders by customer and region.
Freshness Signal: max(loaded_at) from stg.orders_clean
Upstreams: stg.orders_clean, stg.customers_clean
Downstreams: sales.dashboard.revenue
Quality Checks: row_count > 0, null_rate(customer_id) = 0
Glossary Links: Order, Net Revenue
Practical projects
- Impact Analysis Tool: Given a table, list all downstream assets and owners to notify for changes.
- Schema Drift Monitor: Detect unexpected column additions and flag datasets for review.
- Metrics Catalog: Catalog core metrics with SQL definitions and tie them to glossary terms and lineage.
Subskills
- Data Catalog Strategy — Plan scope, ownership, and adoption.
- Business Glossary Implementation — Define terms and approvals.
- Technical Metadata Collection — Extract from DBs, pipelines, BI.
- Lineage Source To Report — Connect raw to marts to dashboards.
- Schema Registry Concepts — Versioning and compatibility rules.
- Dataset Documentation Standards — Templates and required fields.
- Discoverability And Search — Indexing, facets, and badges.
- Automated Documentation Practices — CI checks and generation.
Next steps
- Pick one domain and build a thin slice: metadata model, glossary, lineage, and docs.
- Automate extraction for one system at a time; review data quality signals with owners.
- Roll out usage guidelines and office hours; collect feedback and iterate.