Why this matters
As a Data Engineer, you build and move data. Without a clear data catalog and consistent metadata, people canāt find, trust, or safely use what you build. Good metadata reduces support requests, prevents broken pipelines, and speeds up onboarding.
- Real task: publish a new dataset (name, description, owner, schema, freshness, sensitivity).
- Real task: map lineage so analysts know what upstream tables feed a dashboard.
- Real task: mark PII columns and define access levels for compliance.
- Real task: record data quality checks and their last results.
- Real task: deprecate a dataset with a sunset date and replacement pointer.
Concept explained simply
A data catalog is like a library catalog for your data assets. Metadata is the information card about each asset: what it is, who owns it, how to use it, where it came from, and how fresh and reliable it is.
Mental model
Think "passport + map + care label":
- Passport: identity (name, description, owner, tags).
- Map: lineage and dependencies (upstream/downstream).
- Care label: quality checks, freshness, retention, sensitivity, SLA.
Core types of metadata
- Technical: schemas, data types, storage location, partitions, indexes.
- Business: friendly names, definitions, glossary terms, KPIs, units.
- Operational: freshness, load schedule, run duration, last success, SLA.
- Lineage: upstream inputs, downstream consumers, transform jobs.
- Governance: owners, stewards, sensitivity/PII tags, retention, access level.
- Quality: tests/expectations, last run results, anomaly notes.
Tip: minimum viable metadata (MVM)
Start with Name, Description, Owner, Status, Schema, Freshness, Sensitivity. Expand later.
Worked examples
Example 1: Dataset passport (minimal YAML)
dataset: mart.sales.fact_orders
name: Fact Orders
description: One row per finalized order with revenue and cost aggregates.
owner: data-team@sales
status: active
freshness: <= 2h from source
sensitivity: contains_pii: false
schema:
- column: order_id
type: STRING
description: Unique order identifier
constraints: primary_key
- column: order_date
type: DATE
description: Order booking date (company calendar)
- column: revenue
type: DECIMAL(18,2)
description: Net revenue in USD
- column: channel
type: STRING
description: Sales channel (web, retail, partner)
Why this works
It covers identity, ownership, usability, and key schema details. Itās enough for discovery and first-use.
Example 2: Lineage snippet (JSON)
{
"asset": "mart.sales.fact_orders",
"upstream": [
"raw.sales.orders_raw",
"dim.calendar",
"dim.channel"
],
"downstream": [
"dashboards/revenue_overview",
"mart.sales.customer_ltv"
],
"transform_jobs": ["jobs/build_fact_orders_daily"]
}
Why lineage matters
When an upstream breaks, you instantly know who to notify. When changing a column, you know which downstream assets to update.
Example 3: Quality expectations and results
quality:
expectations:
- name: not_null_order_id
check: order_id is not null
severity: high
- name: revenue_non_negative
check: revenue >= 0
severity: medium
- name: freshness_within_2h
check: loaded_at within 2 hours of now
severity: high
last_run:
timestamp: 2026-01-07T08:15:00Z
failures: [ ]
notes: "All checks passed"
How to use
Store expectations alongside the dataset entry and surface the last_run status in the catalog UI.
How to document a dataset quickly
- Identify the asset: stable name, layer (raw/stage/mart), domain (e.g., sales), and purpose.
- Capture ownership: team email, on-call channel, and a business contact.
- Summarize: 1ā2 sentence plain-English description with scope and grain.
- Schema essentials: column names, types, key constraints, and business-friendly meanings.
- Operational bits: schedule, freshness SLA, last successful run.
- Governance tags: PII flag, access level, retention.
- Lineage: upstream inputs and known downstreams.
- Quality: a few high-value checks and last results.
Time saver: templates
Use a standard template for all datasets so people know where to find each detail.
Exercises
Complete both exercises below. Aim for clarity and consistency over perfection.
Exercise ex1: Minimal metadata spec
Create a minimal metadata entry for a dataset that tracks daily active users (DAU) by app version.
- Asset name: choose a clear layer/path (e.g., mart.product.app_dau).
- Provide: name, description, owner, status, freshness, sensitivity.
- List 4ā6 columns with types and short descriptions.
Exercise ex2: Trace lineage
You ingest events into raw.events. You aggregate to stage.events_daily and then to mart.product.app_dau. Document lineage and one quality expectation.
- List upstream and downstream for each layer.
- Add one freshness SLA for mart.product.app_dau.
Self-check checklist
- ā Every asset has an owner and a working contact.
- ā Descriptions use plain language and define the grain.
- ā Schema includes key columns and meanings.
- ā Freshness SLA is stated and realistic.
- ā Sensitivity/PII tags are present where needed.
- ā Upstream/downstream are listed for at least one hop.
Common mistakes and self-check
- Too technical, no business meaning: If a non-engineer canāt tell what it is, add a plain-English description.
- No owner: Unowned data becomes stale. Always set a team and contact.
- Stale freshness claims: If SLA is missed, update it or fix the pipeline.
- Missing PII flags: Classify sensitive columns early to avoid access issues.
- Hidden lineage: Changes break downstream users. Record upstream/downstream.
- Huge docs, no template: Keep it consistent and scannable.
Self-audit in 5 minutes
- Open three popular datasets and confirm owner, description, freshness, and lineage exist.
- If any are missing, create quick stubs and schedule a follow-up to refine.
Practical projects
- Catalog Sprint: Pick 10 top-used datasets. Apply the minimal template to each and add 1ā2 quality checks. Present before/after impact.
- Lineage Map: For one domain (e.g., sales), draw asset lineage from raw to dashboards. Capture change points and owners.
- Glossary + Tags: Define 10 business terms (e.g., Active User, Revenue) and tag relevant datasets with these terms.
Who this is for
- Data Engineers who publish datasets and maintain pipelines.
- Analytics Engineers standardizing marts and models.
- Data Stewards and Platform Engineers improving discoverability.
Prerequisites
- Basic SQL and understanding of tables, views, and schemas.
- Familiarity with your data platformās storage and scheduling.
- Ability to read pipeline DAGs or job dependencies.
Learning path
- Learn dataset layers (raw/stage/mart) and naming conventions.
- Define a minimal metadata template for your organization.
- Document top datasets and add lineage hops.
- Introduce 2ā3 quality checks per critical dataset.
- Expand with governance tags and glossary terms.
Next steps
- Automate harvesting of technical metadata from your warehouse.
- Standardize owners and on-call contacts for each domain.
- Add status tags (active/deprecated) and retention policies.
- Review datasets quarterly for freshness and usage.
Ready to check your understanding? Take the quick test below. The test is available to everyone; only logged-in users get saved progress.
Mini challenge
You joined a team with scattered docs. In 45 minutes, produce a one-page catalog summary for the top 5 datasets in one domain. Include owner, description, schema highlights, freshness, lineage (1 hop), and sensitivity. Timebox each dataset to 8 minutes and use a uniform template.
Quick Test
Answer a few questions to cement the concepts. The test is available to everyone; only logged-in users get saved progress.