Why this matters
As a Data Architect, you design systems that make data findable, trustworthy, and reusable. Metadata standards align teams on what to capture (owner, description, schema, lineage), how to format it, and who stewards it. They reduce ambiguity, power data catalogs, enable discovery/search, accelerate compliance reviews, and make lineage traceable across platforms.
- Real tasks you will face: define a minimum metadata template for all datasets; enforce naming and definitions for data elements; integrate pipeline lineage into your catalog; map business glossary terms to physical schemas; standardize classifications and access labels.
Concept explained simply
Think of metadata as the library card for your data asset. It tells you what the item is, who owns it, where it lives, how it’s structured, and how it got there.
Mental model: 3 layers
- Vocabulary: what fields we record (title, owner, sensitivity, etc.).
- Structure: how we format it (JSON, Avro, RDF/DCAT).
- Process: who updates it and when (stewards, SLAs, workflows).
Core standards you should know
- ISO/IEC 11179 (Metadata Registries): precise data element naming and definitions, permissible values, stewardship.
- Dublin Core (DC): simple descriptive fields (title, description, subject, creator, date).
- DCAT (Dataset Catalog Vocabulary): cataloging datasets, distributions, publishers, and relationships.
- W3C PROV: provenance model for who/what/when created or transformed data.
- OpenLineage (community spec): lineage events for jobs, inputs/outputs, facets.
- JSON Schema / Avro: structural schema definitions, validation, and evolution.
- ISO 19115 (Geospatial): domain-specific metadata for spatial datasets.
Practical categories:
- Descriptive: title, description, tags, domain, business owner.
- Structural: schema, data types, formats, keys.
- Administrative: steward, SLA, update frequency, retention, classification.
- Provenance/Lineage: source systems, jobs, timestamps, code version.
Worked examples
Example 1 — DCAT-like dataset record
dataset:
title: Customer Master
description: Golden record of customers merged from CRM and e-commerce.
keywords: [customer, mdm, golden-record]
publisher: Data Platform Team
contactPoint: data-stewards@company
temporal: updated daily at 02:00 UTC
distribution:
- format: parquet
accessURL: s3://prod/mdm/customer_master/
byteSize: ~200GB
theme: master-data
identifier: ds_customer_master
accrualPeriodicity: P1D
landingZone: s3://raw/crm/, s3://raw/ecom/Example 2 — ISO 11179-style data element
data_element:
name: Customer Email Address
objectClassTerm: Customer
propertyTerm: Email Address
representationTerm: Text
definition: Primary email address used for customer contact.
dataType: string
pattern: ^[^@\s]+@[^@\s]+\.[^@\s]+$
permissibleValues: free-text (validated by pattern)
steward: Marketing Data Steward
securityClassification: Confidential
businessRules:
- Must be unique per active customer
- Cannot be null for loyalty members
lineageNote: Sourced from CRM, validated by EmailVerificationJob v3.2Example 3 — Simplified lineage event
lineage_event:
eventType: COMPLETE
eventTime: 2026-01-15T02:05:14Z
job: { name: mdm_customer_merge, namespace: prod.spark }
run: { runId: 7e9f-20260115-0200, facets: { codeVersion: git:abc123 } }
inputs:
- { namespace: s3, name: s3://raw/crm/customers_*.parquet }
- { namespace: s3, name: s3://raw/ecom/users_*.parquet }
outputs:
- { namespace: s3, name: s3://prod/mdm/customer_master/part_*.parquet }
producer: metadata-ingester-1.4Why these examples matter
- You get a consistent dataset card for discovery (DCAT-like).
- You define data elements precisely to avoid ambiguity (ISO 11179-like).
- You can trace how outputs were created (lineage/provenance).
Step-by-step: implement metadata standards
Exercises
Do these now. They mirror the graded tasks in the Quick Test.
- Exercise 1 (ex1): Create an MVM template and fill it for a sample dataset.
- Exercise 2 (ex2): Normalize inconsistent column names using ISO 11179 naming style.
Checklist for completion
- Your template includes owner, steward, update frequency, schema, sensitivity, and lineage pointer.
- Each data element has a clear definition and representation term.
- Naming follows object class + property + representation term.
- Lineage captures job name, run id, inputs, outputs, and time.
Common mistakes and self-check
- Mistake: Conflating schema with metadata. Fix: Metadata describes the asset; schema describes its internal structure.
- Mistake: Skipping ownership/steward fields. Fix: Make them mandatory in your MVM.
- Mistake: Over-collecting fields. Fix: Start minimal; expand only when it drives decisions.
- Mistake: Inconsistent naming. Fix: Enforce ISO 11179 naming patterns and review in pull requests.
- Mistake: No lineage timestamps/run ids. Fix: Include runId and eventTime in all lineage events.
Self-check prompt
Pick one production dataset. Can a newcomer answer: What is it? Who owns it? How often updated? How to access safely? Where did it come from? If any answer is unclear, add or fix metadata.
Practical projects
- Project 1: Build a lightweight DCAT-style YAML template and populate it for 10 top datasets.
- Project 2: Create an ISO 11179-style data element registry for your core customer fields.
- Project 3: Emit simplified lineage events from one ETL job and render a run-by-run lineage table.
Who this is for
- Data Architects who define catalog and governance standards.
- Data Engineers integrating pipelines with catalogs and lineage.
- Data Stewards maintaining glossaries and classifications.
Prerequisites
- Basic data modeling and schema evolution concepts.
- Familiarity with data catalogs and pipeline orchestration.
- Comfort reading/writing JSON or YAML.
Learning path
- Start with the MVM template and dataset cards.
- Define naming rules and data element definitions.
- Automate lineage and schema capture from one pilot pipeline.
- Expand to domains and review quarterly.
Next steps
- Apply the template to one domain and run a steward review.
- Automate lineage in your main ETL orchestrations.
- Standardize classifications and retention tags.
Mini challenge
Two teams use different dataset cards. In one hour, propose a unified MVM with 10 fields max. Map both teams’ fields to your MVM and note what is dropped and why.
Quick test
The quick test is available to everyone. If you log in, your progress will be saved so you can continue later.