luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Data Dictionaries

Learn Data Dictionaries for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

As a Data Engineer, you ship tables and pipelines that others rely on. A clear, current data dictionary turns your datasets into a trustworthy product. It helps teammates find the right columns, reduces Slack questions, speeds onboarding, and prevents costly misinterpretations during analytics, ML feature creation, and compliance audits.

  • Real task: Document a new revenue table so analysts can build dashboards without guessing definitions.
  • Real task: Flag columns with personal data for governance and access control.
  • Real task: Explain a breaking schema change and its impact to downstream consumers.
  • Real task: Debug weird metrics by checking column definitions, valid values, and data freshness.

Concept explained simply

A data dictionary is a structured description of your datasets and fields: names, meanings, data types, allowed values, owners, refresh schedules, lineage, and quality expectations. Think of it as the single source of truth for β€œwhat this data means and how to use it.”

Mental model

  • Glossary: clear business-friendly definitions.
  • Contract: stable schema, rules, and expectations for consumers.
  • Map: where data comes from, how it’s transformed, and who to contact.

What to include in a good data dictionary

Dataset-level (table or view)
  • Name and short description
  • Owner/team and contact
  • Source system(s) and brief lineage
  • Refresh schedule and data freshness SLA
  • Row granularity (e.g., one row per order)
  • Access level/privacy class (e.g., contains PII: Yes/No)
  • Quality checks (e.g., row count, not-null, referential integrity)
  • Change log (notable schema/definition changes)
Field-level (columns)
  • Column name and business definition
  • Data type and format (e.g., TIMESTAMP in UTC)
  • Nullability and default value
  • Allowed values or range; units (if numeric)
  • Example value
  • Keys and uniqueness (PK/FK)
  • PII flag and sensitivity notes
  • Computation/transform notes (if derived)
Values and codes (enumerations)
  • List the codes with human-friendly meanings
  • State what happens to unknown/legacy values
  • Provide mapping guidance for downstream reports

Worked examples

Example 1 β€” sales.orders
  • Dataset: sales.orders β€” One row per order. Refreshed hourly. Owner: Sales Data Team.
  • Columns:
    • order_id (STRING) β€” Primary key. Not null. Example: "O-987654"
    • customer_id (STRING) β€” Foreign key to crm.customers.customer_id. Not null.
    • order_ts (TIMESTAMP, UTC) β€” When the order was placed. Nullable (some historical imports missing time).
    • order_status (STRING) β€” Allowed: {"pending","paid","shipped","cancelled"}. Example: "paid"
    • total_amount (DECIMAL(12,2), USD) β€” Sum of item_price * qty - discounts. Not null. Example: 125.50
    • payment_method (STRING) β€” Allowed: {"card","paypal","wire"}. Unknowns map to "other".
    • email (STRING) β€” PII: Yes. Customer contact email captured at checkout.
Example 2 β€” analytics.user_dim
  • Dataset: analytics.user_dim β€” One row per active user-id. Daily snapshot at 00:30 UTC. Owner: Analytics Eng.
  • Columns:
    • user_id (STRING) β€” Business key from identity service. Not null. Unique.
    • first_seen_date (DATE) β€” First event date. Not null.
    • country_code (STRING) β€” ISO-3166-1 alpha-2. Nullable. Example: "US"
    • is_marketing_opt_in (BOOLEAN) β€” True if user consented to marketing. Default: false.
    • lifetime_value_usd (DECIMAL(12,2)) β€” Sum of net revenue attributed to user. Derived from orders.
    • email (STRING) β€” PII: Yes. Masked in restricted views.
Example 3 β€” product.events_fact
  • Dataset: product.events_fact β€” One row per user event. Partitioned by event_date. Streaming updates; eventual consistency < 5 minutes.
  • Columns:
    • event_id (STRING) β€” Unique event identifier. Not null.
    • event_type (STRING) β€” Allowed: {"page_view","signup","purchase","logout"}. Unknowns map to "other".
    • event_ts (TIMESTAMP, UTC) β€” Event timestamp. Not null.
    • user_id (STRING) β€” Nullable (anonymous events). FK to analytics.user_dim.user_id when present.
    • device_type (STRING) β€” {"web","ios","android"}. Nullable.
    • metadata (VARIANT/JSON) β€” Semi-structured attributes; schema-on-read.

Creating a data dictionary in practice

  1. Inventory: list datasets to document (start with top 10 used tables).
  2. Auto-extract schema: pull column names/types/nullability from your warehouse (e.g., DESCRIBE/INFORMATION_SCHEMA).
  3. Add business definitions: meet with domain owners to write clear, non-ambiguous meanings.
  4. Define values/ranges: enumerate codes, units, and constraints. Add examples.
  5. Quality rules: note not-null, uniqueness, referential checks, and freshness SLA.
  6. Ownership and access: owner team, PII flags, and privacy classification.
  7. Version and change log: record added/renamed/deprecated fields with dates.
  8. Publish and maintain: store where teammates already look (catalog, repo docs). Update with each schema change.

Maintaining and versioning

  • Treat the dictionary like code: propose changes, review, and version.
  • Deprecate safely: mark column as deprecated with removal date and replacement guidance.
  • Automate drift detection: compare live schema to documented schema regularly.
  • Record breaking changes: what changed, why, and impact on consumers.
Change log template
  • Date:
  • Owner:
  • Change type: Added | Renamed | Deprecated | Semantic change | Privacy change
  • Scope: dataset/column
  • Details:
  • Impact and migration guidance:

Common mistakes (and how to self-check)

  • Too technical only: definitions that repeat the type but not the meaning. Fix: add business context and examples.
  • Stale docs: dictionary not updated with schema changes. Fix: update alongside PRs and releases.
  • Missing owners: no contact for questions. Fix: set an accountable team.
  • Ignoring enumerations: status fields without allowed values. Fix: list values and meanings.
  • Overstuffed fields: long essays. Fix: be concise; move details to notes.
Self-check checklist
  • Each table has owner, granularity, refresh, and SLA
  • Each column has definition, type, nullability, example
  • PII flagged and access level noted
  • Enumerations fully listed with meanings
  • Change log updated for last release

Exercises

Do these now. They mirror the exercises listed below and help you practice core patterns.

Exercise 1 β€” Draft a payments table dictionary (ID: ex1)

Write a concise data dictionary for a new dataset payments.payments. Include dataset info and at least five columns: payment_id, order_id, user_id, payment_status, amount_usd, paid_at.

  • Focus on business definitions, types, nullability, allowed values, examples, and PII flags if any.
Need a template?
  • Dataset: name, granularity, refresh, owner, PII, SLA
  • payment_id β€” type, meaning, nullability, example
  • order_id β€” type, FK, nullability
  • user_id β€” type, FK, nullability, PII?
  • payment_status β€” allowed values, default/unknown handling
  • amount_usd β€” type/units, range
  • paid_at β€” timestamp format/timezone
Exercise 2 β€” Fix a flawed dictionary (ID: ex2)

Given this snippet, identify and correct issues: vague definitions, missing nullability, inconsistent types, and absent allowed values.

{
  "dataset": "sales.refunds",
  "columns": [
    {"name": "refund_id", "type": "number"},
    {"name": "order_id", "type": "text"},
    {"name": "status", "type": "text", "definition": "state"},
    {"name": "amount", "type": "float"},
    {"name": "processed_at", "type": "date"}
  ]
}
  • Checklist after exercises:
    • Your dataset has owner, granularity, refresh, and PII notes
    • Every column has definition, type, nullability, example
    • Enumerations are explicit
    • FKs and PKs identified

Practical projects

  1. Top-10 Dictionary Sprint: Pick your 10 most-used tables. Auto-extract schema, add business definitions, publish, and announce. Timebox to 1–2 days.
  2. Schema-to-Docs Automation: Query INFORMATION_SCHEMA to export a column inventory (name, type, nullable) and render to a shared doc or sheet. Add manual fields for business definitions and PII.
  3. Governance Upgrade: Add PII classification and enumerations across all user-facing tables. Create a lightweight review checklist for future PRs.

Learning path

  1. Learn table design basics (naming, keys, data types).
  2. Write dataset-level docs: owner, granularity, refresh, SLA, privacy.
  3. Write field-level docs: definitions, examples, constraints.
  4. Add enumerations and value mappings.
  5. Automate schema extraction and drift checks.
  6. Adopt versioning and change logs for documentation.

Who this is for

  • Data Engineers building or maintaining pipelines and models
  • Analytics Engineers documenting semantic layers
  • DBAs and Platform Engineers improving data discoverability

Prerequisites

  • Basic SQL (SELECT, JOINs, data types)
  • Familiarity with your data warehouse structure
  • Understanding of PII and privacy basics

Mini challenge

For a column shipment_status in logistics.shipments, write allowed values and meanings. Add nullability, example, and guidance for unknowns. Keep it under 6 lines.

Assessment and progress

Take the Quick Test below to check your understanding. The test is available to everyone. Sign in to save your progress and resume later.

Next steps

  • Expand documentation to your most-consumed datasets.
  • Integrate dictionary updates into your development workflow.
  • Collaborate with domain owners to keep definitions aligned with business changes.

Practice Exercises

2 exercises to complete

Instructions

Create a concise data dictionary for payments.payments with at least these columns: payment_id, order_id, user_id, payment_status, amount_usd, paid_at.

  • Include dataset-level details: granularity, refresh schedule, owner, PII, and SLA.
  • For each column: type, business definition, nullability, allowed values (if applicable), example value, and PII flag if relevant.
Expected Output
A short, structured description covering dataset info and clear field-level entries with types, definitions, nullability, allowed values, and examples.

Data Dictionaries β€” Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data Dictionaries?

AI Assistant

Ask questions about this tool