Menu

Topic 5 of 8

Dataset Documentation Standards

Learn Dataset Documentation Standards for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

Clear, consistent dataset documentation lets teams discover, trust, and safely use data. As a Data Architect, you set the standards that keep metadata consistent across teams and tooling. This reduces rework, prevents misuse (especially with PII), and enables reliable lineage across pipelines.

  • Real tasks you will face:
    • Define a documentation template required for every curated dataset.
    • Review dataset docs for completeness before promoting to production.
    • Align schema fields with business definitions to avoid conflicting metrics.
    • Capture lineage and data quality checks for audits and incident response.

Concept explained simply

Dataset documentation standards are your agreed rules for what every dataset must describe: what it is, who owns it, where it comes from, how fresh it is, how to use it, and what can go wrong.

Mental model: The README for data

Treat every dataset like a product with a README. A good README answers: Who owns it? What does a row represent? What fields exist and what do they mean? How recent is it? What are known pitfalls? How does it connect to other datasets?

The standard: what to document

  • Identity
    • Name and logical domain (e.g., marketing, finance).
    • Short description (1–2 sentences).
    • Owner and support contact.
    • Classification: Public/Internal/Restricted; note PII/PHI if present.
  • Purpose and usage
    • Business questions it answers.
    • Primary consumers (teams, dashboards, models).
    • Example queries and do/don’t usage notes.
  • Structure
    • Granularity: what a row represents.
    • Keys: primary key, natural keys, foreign keys.
    • Partitions/clustering if applicable.
    • Schema table: field, type, description, nullable, example, sensitivity.
  • Operational metadata
    • Freshness/SLA (e.g., daily by 06:00 UTC), typical delay, retention.
    • Data quality checks (null thresholds, uniqueness, referential integrity).
    • Known issues and caveats.
    • Cost/size notes if relevant (e.g., large table caution for ad-hoc scans).
  • Lineage
    • Upstream sources (systems, datasets) and key transforms.
    • Downstream dependencies (dashboards, models, data products).
    • Change log with dates; mark breaking changes.
  • Runbook
    • How to backfill safely.
    • Who to page if SLAs are missed.

Templates you can reuse

Copy this template when creating a new dataset.

{
  "identity": {
    "name": "",
    "domain": "",
    "description": "",
    "owner": "team@company.com",
    "classification": "Public | Internal | Restricted",
    "contains_pii": false
  },
  "purpose": {
    "business_questions": [""],
    "primary_consumers": [""],
    "usage_notes": ["do:", "dont:"]
  },
  "structure": {
    "granularity": "",
    "primary_key": [""],
    "foreign_keys": [{"field": "", "references": "dataset.field"}],
    "partitions": "",
    "schema": [
      {"field": "", "type": "", "description": "", "nullable": true, "example": "", "sensitivity": "none|pii|restricted"}
    ]
  },
  "operations": {
    "freshness_sla": "",
    "typical_delay": "",
    "retention": "",
    "dq_checks": [""],
    "known_issues": [""],
    "size_notes": ""
  },
  "lineage": {
    "upstream": ["system.dataset"],
    "transform_summary": "",
    "downstream": ["dashboard/model"],
    "change_log": [
      {"date": "YYYY-MM-DD", "change": "", "breaking": false}
    ]
  },
  "runbook": {
    "backfill": [""],
    "escalation": "team@company.com"
  }
}

Worked examples

Example 1 — marketing_events
{
  "identity": {
    "name": "marketing_events",
    "domain": "marketing",
    "description": "Events from paid campaigns after standardization and bot filtering.",
    "owner": "mkt-data@company.com",
    "classification": "Internal",
    "contains_pii": false
  },
  "purpose": {
    "business_questions": [
      "How many qualified clicks and signups per campaign/week?",
      "Which channels drive the lowest CPA?"
    ],
    "primary_consumers": ["Marketing Analytics", "Growth Ops"],
    "usage_notes": [
      "do: aggregate at day or week for reporting",
      "dont: treat row-level counts as human users; some events are de-duplicated"
    ]
  },
  "structure": {
    "granularity": "One standardized marketing event (click, view, signup)",
    "primary_key": ["event_id"],
    "foreign_keys": [{"field": "campaign_id", "references": "dim_campaigns.campaign_id"}],
    "partitions": "by event_date",
    "schema": [
      {"field": "event_id", "type": "STRING", "description": "Unique event identifier", "nullable": false, "example": "ev_7f3", "sensitivity": "none"},
      {"field": "event_date", "type": "DATE", "description": "Event UTC date", "nullable": false, "example": "2026-01-15", "sensitivity": "none"},
      {"field": "campaign_id", "type": "STRING", "description": "Marketing campaign id", "nullable": false, "example": "cmp_123", "sensitivity": "none"},
      {"field": "channel", "type": "STRING", "description": "Source channel (paid_search, social, display)", "nullable": false, "example": "paid_search", "sensitivity": "none"},
      {"field": "event_type", "type": "STRING", "description": "Type: click|view|signup", "nullable": false, "example": "click", "sensitivity": "none"}
    ]
  },
  "operations": {
    "freshness_sla": "Hourly by :20",
    "typical_delay": "5–15 minutes",
    "retention": "18 months",
    "dq_checks": [
      "event_id unique per partition",
      "campaign_id must exist in dim_campaigns",
      "event_date not in future"
    ],
    "known_issues": ["Occasional delays from ad network API during weekends"],
    "size_notes": "~500M rows/month"
  },
  "lineage": {
    "upstream": ["ad_apis.raw_clicks", "ad_apis.raw_impressions", "app.signup_events"],
    "transform_summary": "Standardize fields, dedupe by device+timestamp window, classify channel.",
    "downstream": ["dash_campaign_performance", "mta_model_v2"],
    "change_log": [
      {"date": "2026-01-05", "change": "Added event_type=signup from app data", "breaking": false}
    ]
  },
  "runbook": {
    "backfill": ["Rebuild partition range with dedupe flag", "Validate dq checks before publish"],
    "escalation": "mkt-data@company.com"
  }
}
Example 2 — customers_dim
{
  "identity": {"name": "customers_dim", "domain": "core", "description": "Unified customer profile for analytics.", "owner": "core-data@company.com", "classification": "Restricted", "contains_pii": true},
  "structure": {
    "granularity": "One row per customer_id",
    "primary_key": ["customer_id"],
    "schema": [
      {"field": "customer_id", "type": "STRING", "description": "Stable surrogate id", "nullable": false, "example": "c_1002", "sensitivity": "none"},
      {"field": "email", "type": "STRING", "description": "Primary email (hashed)", "nullable": true, "example": "sha256:...", "sensitivity": "pii"},
      {"field": "country", "type": "STRING", "description": "ISO country code", "nullable": true, "example": "DE", "sensitivity": "none"},
      {"field": "is_active", "type": "BOOLEAN", "description": "Active subscription flag", "nullable": false, "example": true, "sensitivity": "none"}
    ]
  },
  "operations": {"freshness_sla": "Daily by 06:00 UTC", "retention": "Forever (subject to deletion requests)", "dq_checks": ["customer_id unique", "email hashed when present"], "known_issues": []},
  "lineage": {"upstream": ["crm.users", "billing.accounts"], "transform_summary": "Identity resolution using email+device graph; hash PII.", "downstream": ["customer_360_dashboard", "churn_model"], "change_log": []}
}
Example 3 — payments_fact
{
  "identity": {"name": "payments_fact", "domain": "finance", "description": "Settled payments with fees and taxes.", "owner": "fin-data@company.com", "classification": "Internal", "contains_pii": false},
  "structure": {
    "granularity": "One row per settled payment transaction",
    "primary_key": ["payment_id"],
    "foreign_keys": [{"field": "customer_id", "references": "customers_dim.customer_id"}],
    "schema": [
      {"field": "payment_id", "type": "STRING", "description": "Payment transaction id", "nullable": false, "example": "p_9001", "sensitivity": "none"},
      {"field": "amount", "type": "DECIMAL(18,2)", "description": "Gross amount in payment_currency", "nullable": false, "example": "49.99", "sensitivity": "none"},
      {"field": "fee", "type": "DECIMAL(18,2)", "description": "Processing fee", "nullable": false, "example": "1.25", "sensitivity": "none"},
      {"field": "payment_currency", "type": "STRING", "description": "ISO currency", "nullable": false, "example": "USD", "sensitivity": "none"},
      {"field": "settled_at", "type": "TIMESTAMP", "description": "Settlement time (UTC)", "nullable": false, "example": "2026-01-10T13:45:00Z", "sensitivity": "none"}
    ]
  },
  "operations": {"freshness_sla": "Near-real-time (<10m)", "dq_checks": ["payment_id unique", "amount >= 0", "settled_at not null"], "known_issues": ["Occasional duplicate messages handled by idempotency key"]},
  "lineage": {"upstream": ["payments_gateway.events"], "transform_summary": "Filter settled, compute fees, enrich with currency metadata.", "downstream": ["revenue_reporting", "finance_close"], "change_log": [{"date": "2025-12-01", "change": "Added fee column", "breaking": false}]}
}

How to write a good dataset doc in 20 minutes

  1. 5 min: Fill Identity and Purpose. Write a crisp one-line description.
  2. 5 min: Define Granularity, Keys, and top 10 fields with clear meanings.
  3. 5 min: Add Freshness/SLA, 3 DQ checks, and Known issues.
  4. 5 min: Summarize Upstream/Downstream, add one example query, and first Change log entry.

Exercises

Do these in a scratch file or notes app. When you are done, compare with the solutions below.

  • Exercise 1: Draft a full doc for user_sessions (web/app sessions) using the template.
  • Exercise 2: Audit a flawed doc and list missing mandatory items and corrections.
Checklist for both exercises
  • Has one-line description and owner?
  • Granularity and primary key clearly stated?
  • Top fields documented with type and meaning?
  • Freshness/SLA and at least 3 data quality checks?
  • Lineage: upstream, downstream, and transform summary?
  • Classification and PII flags?
  • Change log with dates?

Common mistakes and self-check

  • Vague granularity (e.g., “events”) instead of a precise statement.
  • Missing ownership contact, causing escalation delays.
  • Omitting PII classification for fields like email/phone.
  • Not stating freshness/SLA, leading to outdated dashboards.
  • Schema fields describe how to compute, not what they mean.
  • No change log, making breaking changes invisible.
Self-check

Pick one of your existing dataset docs. In 3 minutes, highlight the sentence that explains granularity, the owner email, and when it updates. If you cannot find these quickly, your docs need revision.

Practical projects

  • Create a shared dataset documentation template in your org’s preferred format (JSON or Markdown) and run a 3-dataset pilot.
  • Backfill lineage: for one domain, document upstream/downstream and add at least one DQ check per dataset.
  • Introduce a “Doc Gate” in CI: a dataset cannot be promoted unless required sections are present (policy and checklist only; implementation details vary by stack).

Mini challenge

In 5 sentences or less, document a new feature_store.user_churn_features dataset: purpose, granularity, primary key, freshness, and one risk/caveat.

Learning path

  • Start here: Dataset Documentation Standards (this page).
  • Then: Lineage capture practices (upstream/downstream mapping).
  • Next: Data quality checks and SLAs in production.
  • Finally: Governance and access classification for regulated data.

Who this is for

  • Data Architects defining platform standards.
  • Anyone publishing datasets to be reused by others.

Prerequisites

  • Basic SQL and understanding of schemas.
  • Familiarity with your org’s data domains and pipelines.

Next steps

  • Pick one important dataset; bring its documentation up to this standard.
  • Schedule a 15-minute review with an adjacent team for clarity feedback.
  • Keep docs living: update the change log whenever you alter the dataset.

Quick Test

Take the quick test to check your understanding. Available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Using the template, create documentation for a user_sessions dataset that aggregates web/app sessions.

  • Assume: one row = one session; primary key = session_id; foreign key = customer_id to customers_dim.
  • Fields: session_id (STRING), customer_id (STRING, optional), started_at (TIMESTAMP), duration_sec (INT), device_type (STRING), source_channel (STRING), country (STRING).
  • Freshness: daily by 05:00 UTC; retention: 400 days.
  • Include at least 3 data quality checks, and a lineage summary referencing raw web logs and mobile analytics.
Expected Output
A complete JSON or Markdown-like doc covering identity, purpose, structure (granularity, keys, schema), operations (freshness, DQ checks, retention), lineage (upstream/downstream, transform summary), runbook, and a change log entry.

Dataset Documentation Standards — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Dataset Documentation Standards?

AI Assistant

Ask questions about this tool