Why this matters
Clear, consistent dataset documentation lets teams discover, trust, and safely use data. As a Data Architect, you set the standards that keep metadata consistent across teams and tooling. This reduces rework, prevents misuse (especially with PII), and enables reliable lineage across pipelines.
- Real tasks you will face:
- Define a documentation template required for every curated dataset.
- Review dataset docs for completeness before promoting to production.
- Align schema fields with business definitions to avoid conflicting metrics.
- Capture lineage and data quality checks for audits and incident response.
Concept explained simply
Dataset documentation standards are your agreed rules for what every dataset must describe: what it is, who owns it, where it comes from, how fresh it is, how to use it, and what can go wrong.
Mental model: The README for data
Treat every dataset like a product with a README. A good README answers: Who owns it? What does a row represent? What fields exist and what do they mean? How recent is it? What are known pitfalls? How does it connect to other datasets?
The standard: what to document
- Identity
- Name and logical domain (e.g., marketing, finance).
- Short description (1–2 sentences).
- Owner and support contact.
- Classification: Public/Internal/Restricted; note PII/PHI if present.
- Purpose and usage
- Business questions it answers.
- Primary consumers (teams, dashboards, models).
- Example queries and do/don’t usage notes.
- Structure
- Granularity: what a row represents.
- Keys: primary key, natural keys, foreign keys.
- Partitions/clustering if applicable.
- Schema table: field, type, description, nullable, example, sensitivity.
- Operational metadata
- Freshness/SLA (e.g., daily by 06:00 UTC), typical delay, retention.
- Data quality checks (null thresholds, uniqueness, referential integrity).
- Known issues and caveats.
- Cost/size notes if relevant (e.g., large table caution for ad-hoc scans).
- Lineage
- Upstream sources (systems, datasets) and key transforms.
- Downstream dependencies (dashboards, models, data products).
- Change log with dates; mark breaking changes.
- Runbook
- How to backfill safely.
- Who to page if SLAs are missed.
Templates you can reuse
Copy this template when creating a new dataset.
{
"identity": {
"name": "",
"domain": "",
"description": "",
"owner": "team@company.com",
"classification": "Public | Internal | Restricted",
"contains_pii": false
},
"purpose": {
"business_questions": [""],
"primary_consumers": [""],
"usage_notes": ["do:", "dont:"]
},
"structure": {
"granularity": "",
"primary_key": [""],
"foreign_keys": [{"field": "", "references": "dataset.field"}],
"partitions": "",
"schema": [
{"field": "", "type": "", "description": "", "nullable": true, "example": "", "sensitivity": "none|pii|restricted"}
]
},
"operations": {
"freshness_sla": "",
"typical_delay": "",
"retention": "",
"dq_checks": [""],
"known_issues": [""],
"size_notes": ""
},
"lineage": {
"upstream": ["system.dataset"],
"transform_summary": "",
"downstream": ["dashboard/model"],
"change_log": [
{"date": "YYYY-MM-DD", "change": "", "breaking": false}
]
},
"runbook": {
"backfill": [""],
"escalation": "team@company.com"
}
}
Worked examples
Example 1 — marketing_events
{
"identity": {
"name": "marketing_events",
"domain": "marketing",
"description": "Events from paid campaigns after standardization and bot filtering.",
"owner": "mkt-data@company.com",
"classification": "Internal",
"contains_pii": false
},
"purpose": {
"business_questions": [
"How many qualified clicks and signups per campaign/week?",
"Which channels drive the lowest CPA?"
],
"primary_consumers": ["Marketing Analytics", "Growth Ops"],
"usage_notes": [
"do: aggregate at day or week for reporting",
"dont: treat row-level counts as human users; some events are de-duplicated"
]
},
"structure": {
"granularity": "One standardized marketing event (click, view, signup)",
"primary_key": ["event_id"],
"foreign_keys": [{"field": "campaign_id", "references": "dim_campaigns.campaign_id"}],
"partitions": "by event_date",
"schema": [
{"field": "event_id", "type": "STRING", "description": "Unique event identifier", "nullable": false, "example": "ev_7f3", "sensitivity": "none"},
{"field": "event_date", "type": "DATE", "description": "Event UTC date", "nullable": false, "example": "2026-01-15", "sensitivity": "none"},
{"field": "campaign_id", "type": "STRING", "description": "Marketing campaign id", "nullable": false, "example": "cmp_123", "sensitivity": "none"},
{"field": "channel", "type": "STRING", "description": "Source channel (paid_search, social, display)", "nullable": false, "example": "paid_search", "sensitivity": "none"},
{"field": "event_type", "type": "STRING", "description": "Type: click|view|signup", "nullable": false, "example": "click", "sensitivity": "none"}
]
},
"operations": {
"freshness_sla": "Hourly by :20",
"typical_delay": "5–15 minutes",
"retention": "18 months",
"dq_checks": [
"event_id unique per partition",
"campaign_id must exist in dim_campaigns",
"event_date not in future"
],
"known_issues": ["Occasional delays from ad network API during weekends"],
"size_notes": "~500M rows/month"
},
"lineage": {
"upstream": ["ad_apis.raw_clicks", "ad_apis.raw_impressions", "app.signup_events"],
"transform_summary": "Standardize fields, dedupe by device+timestamp window, classify channel.",
"downstream": ["dash_campaign_performance", "mta_model_v2"],
"change_log": [
{"date": "2026-01-05", "change": "Added event_type=signup from app data", "breaking": false}
]
},
"runbook": {
"backfill": ["Rebuild partition range with dedupe flag", "Validate dq checks before publish"],
"escalation": "mkt-data@company.com"
}
}
Example 2 — customers_dim
{
"identity": {"name": "customers_dim", "domain": "core", "description": "Unified customer profile for analytics.", "owner": "core-data@company.com", "classification": "Restricted", "contains_pii": true},
"structure": {
"granularity": "One row per customer_id",
"primary_key": ["customer_id"],
"schema": [
{"field": "customer_id", "type": "STRING", "description": "Stable surrogate id", "nullable": false, "example": "c_1002", "sensitivity": "none"},
{"field": "email", "type": "STRING", "description": "Primary email (hashed)", "nullable": true, "example": "sha256:...", "sensitivity": "pii"},
{"field": "country", "type": "STRING", "description": "ISO country code", "nullable": true, "example": "DE", "sensitivity": "none"},
{"field": "is_active", "type": "BOOLEAN", "description": "Active subscription flag", "nullable": false, "example": true, "sensitivity": "none"}
]
},
"operations": {"freshness_sla": "Daily by 06:00 UTC", "retention": "Forever (subject to deletion requests)", "dq_checks": ["customer_id unique", "email hashed when present"], "known_issues": []},
"lineage": {"upstream": ["crm.users", "billing.accounts"], "transform_summary": "Identity resolution using email+device graph; hash PII.", "downstream": ["customer_360_dashboard", "churn_model"], "change_log": []}
}
Example 3 — payments_fact
{
"identity": {"name": "payments_fact", "domain": "finance", "description": "Settled payments with fees and taxes.", "owner": "fin-data@company.com", "classification": "Internal", "contains_pii": false},
"structure": {
"granularity": "One row per settled payment transaction",
"primary_key": ["payment_id"],
"foreign_keys": [{"field": "customer_id", "references": "customers_dim.customer_id"}],
"schema": [
{"field": "payment_id", "type": "STRING", "description": "Payment transaction id", "nullable": false, "example": "p_9001", "sensitivity": "none"},
{"field": "amount", "type": "DECIMAL(18,2)", "description": "Gross amount in payment_currency", "nullable": false, "example": "49.99", "sensitivity": "none"},
{"field": "fee", "type": "DECIMAL(18,2)", "description": "Processing fee", "nullable": false, "example": "1.25", "sensitivity": "none"},
{"field": "payment_currency", "type": "STRING", "description": "ISO currency", "nullable": false, "example": "USD", "sensitivity": "none"},
{"field": "settled_at", "type": "TIMESTAMP", "description": "Settlement time (UTC)", "nullable": false, "example": "2026-01-10T13:45:00Z", "sensitivity": "none"}
]
},
"operations": {"freshness_sla": "Near-real-time (<10m)", "dq_checks": ["payment_id unique", "amount >= 0", "settled_at not null"], "known_issues": ["Occasional duplicate messages handled by idempotency key"]},
"lineage": {"upstream": ["payments_gateway.events"], "transform_summary": "Filter settled, compute fees, enrich with currency metadata.", "downstream": ["revenue_reporting", "finance_close"], "change_log": [{"date": "2025-12-01", "change": "Added fee column", "breaking": false}]}
}
How to write a good dataset doc in 20 minutes
- 5 min: Fill Identity and Purpose. Write a crisp one-line description.
- 5 min: Define Granularity, Keys, and top 10 fields with clear meanings.
- 5 min: Add Freshness/SLA, 3 DQ checks, and Known issues.
- 5 min: Summarize Upstream/Downstream, add one example query, and first Change log entry.
Exercises
Do these in a scratch file or notes app. When you are done, compare with the solutions below.
- Exercise 1: Draft a full doc for
user_sessions(web/app sessions) using the template. - Exercise 2: Audit a flawed doc and list missing mandatory items and corrections.
Checklist for both exercises
- Has one-line description and owner?
- Granularity and primary key clearly stated?
- Top fields documented with type and meaning?
- Freshness/SLA and at least 3 data quality checks?
- Lineage: upstream, downstream, and transform summary?
- Classification and PII flags?
- Change log with dates?
Common mistakes and self-check
- Vague granularity (e.g., “events”) instead of a precise statement.
- Missing ownership contact, causing escalation delays.
- Omitting PII classification for fields like email/phone.
- Not stating freshness/SLA, leading to outdated dashboards.
- Schema fields describe how to compute, not what they mean.
- No change log, making breaking changes invisible.
Self-check
Pick one of your existing dataset docs. In 3 minutes, highlight the sentence that explains granularity, the owner email, and when it updates. If you cannot find these quickly, your docs need revision.
Practical projects
- Create a shared dataset documentation template in your org’s preferred format (JSON or Markdown) and run a 3-dataset pilot.
- Backfill lineage: for one domain, document upstream/downstream and add at least one DQ check per dataset.
- Introduce a “Doc Gate” in CI: a dataset cannot be promoted unless required sections are present (policy and checklist only; implementation details vary by stack).
Mini challenge
In 5 sentences or less, document a new feature_store.user_churn_features dataset: purpose, granularity, primary key, freshness, and one risk/caveat.
Learning path
- Start here: Dataset Documentation Standards (this page).
- Then: Lineage capture practices (upstream/downstream mapping).
- Next: Data quality checks and SLAs in production.
- Finally: Governance and access classification for regulated data.
Who this is for
- Data Architects defining platform standards.
- Anyone publishing datasets to be reused by others.
Prerequisites
- Basic SQL and understanding of schemas.
- Familiarity with your org’s data domains and pipelines.
Next steps
- Pick one important dataset; bring its documentation up to this standard.
- Schedule a 15-minute review with an adjacent team for clarity feedback.
- Keep docs living: update the change log whenever you alter the dataset.
Quick Test
Take the quick test to check your understanding. Available to everyone; only logged-in users get saved progress.