Why this matters
As a Data Engineer, you ship tables and pipelines that others rely on. A clear, current data dictionary turns your datasets into a trustworthy product. It helps teammates find the right columns, reduces Slack questions, speeds onboarding, and prevents costly misinterpretations during analytics, ML feature creation, and compliance audits.
- Real task: Document a new revenue table so analysts can build dashboards without guessing definitions.
- Real task: Flag columns with personal data for governance and access control.
- Real task: Explain a breaking schema change and its impact to downstream consumers.
- Real task: Debug weird metrics by checking column definitions, valid values, and data freshness.
Concept explained simply
A data dictionary is a structured description of your datasets and fields: names, meanings, data types, allowed values, owners, refresh schedules, lineage, and quality expectations. Think of it as the single source of truth for βwhat this data means and how to use it.β
Mental model
- Glossary: clear business-friendly definitions.
- Contract: stable schema, rules, and expectations for consumers.
- Map: where data comes from, how itβs transformed, and who to contact.
What to include in a good data dictionary
Dataset-level (table or view)
- Name and short description
- Owner/team and contact
- Source system(s) and brief lineage
- Refresh schedule and data freshness SLA
- Row granularity (e.g., one row per order)
- Access level/privacy class (e.g., contains PII: Yes/No)
- Quality checks (e.g., row count, not-null, referential integrity)
- Change log (notable schema/definition changes)
Field-level (columns)
- Column name and business definition
- Data type and format (e.g., TIMESTAMP in UTC)
- Nullability and default value
- Allowed values or range; units (if numeric)
- Example value
- Keys and uniqueness (PK/FK)
- PII flag and sensitivity notes
- Computation/transform notes (if derived)
Values and codes (enumerations)
- List the codes with human-friendly meanings
- State what happens to unknown/legacy values
- Provide mapping guidance for downstream reports
Worked examples
Example 1 β sales.orders
- Dataset: sales.orders β One row per order. Refreshed hourly. Owner: Sales Data Team.
- Columns:
- order_id (STRING) β Primary key. Not null. Example: "O-987654"
- customer_id (STRING) β Foreign key to crm.customers.customer_id. Not null.
- order_ts (TIMESTAMP, UTC) β When the order was placed. Nullable (some historical imports missing time).
- order_status (STRING) β Allowed: {"pending","paid","shipped","cancelled"}. Example: "paid"
- total_amount (DECIMAL(12,2), USD) β Sum of item_price * qty - discounts. Not null. Example: 125.50
- payment_method (STRING) β Allowed: {"card","paypal","wire"}. Unknowns map to "other".
- email (STRING) β PII: Yes. Customer contact email captured at checkout.
Example 2 β analytics.user_dim
- Dataset: analytics.user_dim β One row per active user-id. Daily snapshot at 00:30 UTC. Owner: Analytics Eng.
- Columns:
- user_id (STRING) β Business key from identity service. Not null. Unique.
- first_seen_date (DATE) β First event date. Not null.
- country_code (STRING) β ISO-3166-1 alpha-2. Nullable. Example: "US"
- is_marketing_opt_in (BOOLEAN) β True if user consented to marketing. Default: false.
- lifetime_value_usd (DECIMAL(12,2)) β Sum of net revenue attributed to user. Derived from orders.
- email (STRING) β PII: Yes. Masked in restricted views.
Example 3 β product.events_fact
- Dataset: product.events_fact β One row per user event. Partitioned by event_date. Streaming updates; eventual consistency < 5 minutes.
- Columns:
- event_id (STRING) β Unique event identifier. Not null.
- event_type (STRING) β Allowed: {"page_view","signup","purchase","logout"}. Unknowns map to "other".
- event_ts (TIMESTAMP, UTC) β Event timestamp. Not null.
- user_id (STRING) β Nullable (anonymous events). FK to analytics.user_dim.user_id when present.
- device_type (STRING) β {"web","ios","android"}. Nullable.
- metadata (VARIANT/JSON) β Semi-structured attributes; schema-on-read.
Creating a data dictionary in practice
- Inventory: list datasets to document (start with top 10 used tables).
- Auto-extract schema: pull column names/types/nullability from your warehouse (e.g., DESCRIBE/INFORMATION_SCHEMA).
- Add business definitions: meet with domain owners to write clear, non-ambiguous meanings.
- Define values/ranges: enumerate codes, units, and constraints. Add examples.
- Quality rules: note not-null, uniqueness, referential checks, and freshness SLA.
- Ownership and access: owner team, PII flags, and privacy classification.
- Version and change log: record added/renamed/deprecated fields with dates.
- Publish and maintain: store where teammates already look (catalog, repo docs). Update with each schema change.
Maintaining and versioning
- Treat the dictionary like code: propose changes, review, and version.
- Deprecate safely: mark column as deprecated with removal date and replacement guidance.
- Automate drift detection: compare live schema to documented schema regularly.
- Record breaking changes: what changed, why, and impact on consumers.
Change log template
- Date:
- Owner:
- Change type: Added | Renamed | Deprecated | Semantic change | Privacy change
- Scope: dataset/column
- Details:
- Impact and migration guidance:
Common mistakes (and how to self-check)
- Too technical only: definitions that repeat the type but not the meaning. Fix: add business context and examples.
- Stale docs: dictionary not updated with schema changes. Fix: update alongside PRs and releases.
- Missing owners: no contact for questions. Fix: set an accountable team.
- Ignoring enumerations: status fields without allowed values. Fix: list values and meanings.
- Overstuffed fields: long essays. Fix: be concise; move details to notes.
Self-check checklist
- Each table has owner, granularity, refresh, and SLA
- Each column has definition, type, nullability, example
- PII flagged and access level noted
- Enumerations fully listed with meanings
- Change log updated for last release
Exercises
Do these now. They mirror the exercises listed below and help you practice core patterns.
Exercise 1 β Draft a payments table dictionary (ID: ex1)
Write a concise data dictionary for a new dataset payments.payments. Include dataset info and at least five columns: payment_id, order_id, user_id, payment_status, amount_usd, paid_at.
- Focus on business definitions, types, nullability, allowed values, examples, and PII flags if any.
Need a template?
- Dataset: name, granularity, refresh, owner, PII, SLA
- payment_id β type, meaning, nullability, example
- order_id β type, FK, nullability
- user_id β type, FK, nullability, PII?
- payment_status β allowed values, default/unknown handling
- amount_usd β type/units, range
- paid_at β timestamp format/timezone
Exercise 2 β Fix a flawed dictionary (ID: ex2)
Given this snippet, identify and correct issues: vague definitions, missing nullability, inconsistent types, and absent allowed values.
{
"dataset": "sales.refunds",
"columns": [
{"name": "refund_id", "type": "number"},
{"name": "order_id", "type": "text"},
{"name": "status", "type": "text", "definition": "state"},
{"name": "amount", "type": "float"},
{"name": "processed_at", "type": "date"}
]
}
- Checklist after exercises:
- Your dataset has owner, granularity, refresh, and PII notes
- Every column has definition, type, nullability, example
- Enumerations are explicit
- FKs and PKs identified
Practical projects
- Top-10 Dictionary Sprint: Pick your 10 most-used tables. Auto-extract schema, add business definitions, publish, and announce. Timebox to 1β2 days.
- Schema-to-Docs Automation: Query INFORMATION_SCHEMA to export a column inventory (name, type, nullable) and render to a shared doc or sheet. Add manual fields for business definitions and PII.
- Governance Upgrade: Add PII classification and enumerations across all user-facing tables. Create a lightweight review checklist for future PRs.
Learning path
- Learn table design basics (naming, keys, data types).
- Write dataset-level docs: owner, granularity, refresh, SLA, privacy.
- Write field-level docs: definitions, examples, constraints.
- Add enumerations and value mappings.
- Automate schema extraction and drift checks.
- Adopt versioning and change logs for documentation.
Who this is for
- Data Engineers building or maintaining pipelines and models
- Analytics Engineers documenting semantic layers
- DBAs and Platform Engineers improving data discoverability
Prerequisites
- Basic SQL (SELECT, JOINs, data types)
- Familiarity with your data warehouse structure
- Understanding of PII and privacy basics
Mini challenge
For a column shipment_status in logistics.shipments, write allowed values and meanings. Add nullability, example, and guidance for unknowns. Keep it under 6 lines.
Assessment and progress
Take the Quick Test below to check your understanding. The test is available to everyone. Sign in to save your progress and resume later.
Next steps
- Expand documentation to your most-consumed datasets.
- Integrate dictionary updates into your development workflow.
- Collaborate with domain owners to keep definitions aligned with business changes.