What this skill covers
Data Catalog and Governance is how a Data Platform Engineer makes data discoverable, trustworthy, compliant, and reusable. You’ll define metadata, lineage, ownership, contracts, approvals, and documentation so teams can self-serve with confidence. Mastering this unlocks safer releases, faster debugging, and better collaboration with analysts, ML engineers, and compliance.
What success looks like
- Every key dataset has a clear owner, steward, and SLA.
- Datasets are searchable by domain, tags, quality, and certifications.
- Lineage shows how metrics are built and where data flows.
- Schema changes follow an approval workflow and don’t break consumers.
- Documentation stays close to the data and is kept current.
Who this is for
- Data Platform Engineers implementing catalogs, contracts, and governance workflows.
- Data Engineers who own pipelines and need reliable lineage and metadata.
- Analytics Engineers and BI Developers who want certified, well-documented datasets.
Prerequisites
- Comfort with SQL (CTE, DDL, schema evolution).
- Basic understanding of data warehousing concepts (tables, views, schemas).
- Familiarity with version control and code reviews.
What you will be able to do
- Model metadata for datasets, columns, owners, SLAs, tags, and usage.
- Capture lineage from pipelines and queries.
- Define data contracts and manage schema changes safely.
- Run governance workflows for approvals, deprecations, and certifications.
- Publish a discoverable, documented, self-serve catalog.
Practical roadmap
1) Inventory core datasets
List high-impact datasets (facts, dimensions, curated marts). Capture name, domain, owner, and consumers.
2) Define metadata model
Minimum viable metadata: title, description, owner, steward, SLA, PII flags, tags, freshness, quality status, certification status.
3) Establish lineage capture
Pick a consistent way to record upstream/downstream relationships from jobs, SQL models, or orchestration logs.
4) Introduce contracts and registry
Create machine-readable schema contracts for key datasets; enforce compatibility checks in CI.
5) Governance workflows
Define approval paths for PII changes, breaking schema changes, and dataset certifications.
6) Documentation standards
Adopt a template and a “docs PR” rule for new or changed datasets. Tie docs to code.
7) Enable self-serve discovery
Tagging, search facets, example queries, and usage hints so teams can onboard fast.
Worked examples
1) Minimal dataset catalog entry (YAML)
dataset: analytics.sales_orders
domain: sales
summary: Curated orders with one row per order_id.
owner: team-sales-platform
steward: data-governance
slas:
freshness_minutes: 60
availability: 99.9
classifications:
pii: false
contains_financial: true
quality:
certification: bronze
last_validation: 2026-01-05
checks:
- name: non_null_order_id
status: pass
- name: valid_status_enum
status: pass
tags: [curated, revenue, core]
columns:
- name: order_id
type: STRING
description: Primary key for an order.
- name: order_total
type: DECIMAL(12,2)
description: Order amount in shop currency.
quality_checks:
- min: 0
- name: created_at
type: TIMESTAMP
description: Order creation time (UTC).
Store this alongside code so it’s versioned and reviewable.
2) Recording lineage from a transformation
Suppose a job builds analytics.sales_orders from staging tables:
-- Build statement
CREATE OR REPLACE TABLE analytics.sales_orders AS
WITH s AS (
SELECT * FROM staging.orders
)
SELECT order_id, total AS order_total, created_at
FROM s;
-- Lineage capture (pseudo-SQL meta)
INSERT INTO governance.lineage (upstream, downstream, relation_type, job_id)
VALUES
('staging.orders', 'analytics.sales_orders', 'select', 'job_123');
Automate lineage by emitting records during job runs.
3) Business glossary terms linked to data dictionary
{
"term": "Gross Revenue",
"definition": "Sum of order_total before discounts, returns, and taxes.",
"owners": ["finance-analytics"],
"related_datasets": [
{
"dataset": "analytics.sales_orders",
"column": "order_total",
"note": "Used in gross revenue pre-adjustments"
}
],
"approved": true
}
Link business terms to concrete columns to reduce ambiguity.
4) Schema contract with compatibility rules
{
"$schema": "https://example.org/data-contract.schema.json",
"name": "analytics.sales_orders",
"version": 3,
"mode": "append_compatible",
"fields": [
{"name": "order_id", "type": "string", "required": true},
{"name": "order_total", "type": "decimal", "required": true},
{"name": "created_at", "type": "timestamp", "required": true},
{"name": "status", "type": "string", "required": false, "enum": ["placed","paid","shipped","canceled"]}
],
"constraints": {
"primary_key": ["order_id"],
"non_negative": ["order_total"]
}
}
Breaking change examples: dropping a required field or changing a type incompatibly. Enforce checks in CI before deployment.
5) Governance workflow: certification and approvals
request:
action: certify
dataset: analytics.sales_orders
target_level: silver
justification: "Data quality checks stable for 30 days; lineage complete."
evidence:
- check_suite: sales_orders_qc_v2
pass_rate_30d: 99.8
- lineage_coverage: 100
approvals:
required: ["data-steward", "security"]
granted:
- role: data-steward
by: alice
at: 2026-01-03T12:01:00Z
- role: security
by: bob
at: 2026-01-03T14:22:00Z
result:
certification: silver
expires_at: 2026-07-01T00:00:00Z
Automate badge changes only after required approvals are recorded.
6) Discovery enablement: tags and usage hints
-- Example: Tagging and usage note
INSERT INTO catalog.tags(dataset, tag) VALUES
('analytics.sales_orders','curated'),
('analytics.sales_orders','revenue');
INSERT INTO catalog.usage_notes(dataset, note)
VALUES ('analytics.sales_orders','Use for revenue trend analysis; see also analytics.order_items for item-level details.');
-- Example: Search
SELECT dataset, summary
FROM catalog.datasets
WHERE domain = 'sales'
AND 'curated' = ANY(tags)
AND certification IN ('silver','gold');
Surface a small set of high-signal tags and a clear usage note to speed discovery.
Drills and exercises
- Create a YAML metadata file for one dataset with owner, steward, SLA, tags, and three column descriptions.
- Record lineage for a pipeline that reads two sources and writes one mart.
- Write a data contract that adds a new optional column without breaking compatibility.
- Draft a certification request with evidence and identify who must approve.
- Tag three datasets consistently and write a search query to find them by domain and certification.
Common mistakes and debugging tips
- Unowned datasets: Always assign an accountable owner and an operational steward.
- Docs drift: Keep docs in the same repo as data code; require doc updates in PRs.
- Hidden breaking changes: Use contracts to catch type changes, nullability changes, or semantic renames.
- Incomplete lineage: Automate lineage from orchestration/job logs; backfill key historical jobs once.
- Tag sprawl: Standardize a short, approved tag vocabulary; avoid synonyms that fragment search.
- Over-restrictive approvals: Use risk-based workflows; trivial non-breaking changes should be auto-approved.
Mini project: Launch a governed revenue mart
Goal: Publish analytics.revenue_daily with full metadata, lineage, contract, and a silver certification.
- Design the schema and write a contract (include PK, types, and required fields).
- Create the dataset with a transformation job. Emit lineage from sources to mart.
- Write a metadata YAML with owner, steward, SLA (freshness 60 min), tags (curated, revenue), and column docs.
- Add two quality checks (non-null PK; non-negative revenue).
- Submit a certification request with 7-day quality evidence; route to steward and security.
- Add a usage note and two example queries.
Acceptance criteria
- Contract passes CI checks; no breaking changes.
- Lineage shows all upstream sources.
- Metadata appears in catalog with searchable tags.
- Quality checks green for 7 consecutive days.
- Certification badge set to silver.
Practical projects to reinforce learning
- Dataset deprecation workflow: Safely retire an unused table with sunset notices and redirects.
- PII governance: Flag and mask PII columns; require security approval on changes.
- Domain catalog rollout: Apply a standard template to three domains (sales, marketing, ops) and measure search success.
Learning path
- Start with Metadata Collection and Lineage.
- Add a Data Dictionary and Business Glossary.
- Define Ownership and Stewardship roles.
- Introduce Schema Registry and Contracts.
- Set up Governance Workflows and Approvals.
- Publish Certification and Quality Badges.
- Enable Discovery and Self-Serve.
- Enforce Documentation Standards.
Subskills
- Metadata Collection And Lineage — Capture dataset metadata and end-to-end data flow.
- Data Dictionary And Business Glossary — Standardize term definitions and link to columns.
- Dataset Ownership And Stewardship — Assign accountable owners and operational stewards.
- Certification And Quality Badges — Signal dataset trust levels with evidence-backed badges.
- Schema Registry And Contracts — Define machine-readable schemas and compatibility rules.
- Governance Workflows And Approvals — Route risky changes to the right approvers.
- Discovery And Self Serve Enablement — Make data easy to find and use with tags and examples.
- Documentation Standards — Templates, review rules, and freshness for durable docs.
Next steps
- Practice with one high-impact dataset and ship full metadata plus lineage.
- Add contracts and CI checks to your top three curated datasets.
- Pilot a certification workflow with a steward and iterate on criteria.