Why this matters
As a Data Platform Engineer, you build trustworthy data. A data dictionary and business glossary are how teams agree on what data means and how it should be used. They reduce rework, prevent broken dashboards, and make compliance simpler.
- Real tasks you will do: define column-level metadata; capture metric formulas; tag PII; align teams on terms like "Active Customer"; document lineage for audits; standardize naming.
- Outcome: faster onboarding, fewer conflicting definitions, and a safer, compliant platform.
Who this is for
- Data Platform Engineers and Data Engineers who publish datasets.
- Analytics Engineers and BI Developers defining metrics and models.
- Data Stewards and Product Analysts who curate definitions.
Prerequisites
- Basic SQL (SELECT, JOIN, COUNT).
- Understanding of tables, columns, data types, and constraints.
- Familiarity with your organization’s main data domains (e.g., customers, orders, subscriptions).
Concept explained simply
Think of the data dictionary as the parts list for your data tables (column-level details). The business glossary is the rulebook of shared meanings (business terms and metrics). Together, they ensure that when someone says "Active Customer," everyone calculates it the same way and knows which columns and tables support it.
Mental model
Use a two-layer mental model:
- Layer 1: Physical layer (Data Dictionary) — what exists in storage: columns, types, nullability, allowed values, owners, quality checks, sensitivity.
- Layer 2: Semantic layer (Business Glossary) — shared meaning: term definitions, formulas, scope, exclusions, synonyms, and source-of-truth datasets.
Map terms in the glossary to columns and datasets in the dictionary. That creates traceability from concept to implementation.
Key components
Data Dictionary — include these fields
- System/Table/Column names
- Business-friendly label
- Data type and format (e.g., varchar(36) UUID)
- Nullability, uniqueness, primary key
- Allowed values / reference table
- Example values
- Calculation logic (if derived)
- Lineage (upstream tables, transformations)
- Data quality checks (constraints, tests)
- Security classification (e.g., PII, Confidential)
- Owner and steward
- Retention policy
- Last updated
Business Glossary — include these fields
- Term name and concise definition
- Business context/scope (what’s included/excluded)
- Formula (for metrics) with exact logic
- Time window and grain (e.g., daily, monthly)
- Source of truth (dataset/view name)
- Related terms and synonyms
- Approval status and steward
- Change history / last review date
Standards and good practices
- Naming: use clear, lowercase_snake_case for columns; use Title Case for terms.
- Be unambiguous: include inclusion/exclusion rules; avoid vague words like "recent" or "significant" without exact ranges.
- Keep versions: when definitions change, version and date them to protect historical reports.
- Link terms to columns and datasets: every metric should reference the exact dataset and columns used.
- Classify sensitivity: tag PII/PHI/PCI and define handling rules.
- Review cadence: set quarterly reviews for top terms and critical columns.
Worked examples
Example 1 — Column dictionary entry
System: analytics_warehouse Table: dim_customer Column: customer_id Label: Customer ID Type: string (UUID), format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Nullability: NOT NULL Uniqueness: unique per row (primary key) Allowed values: must match UUID format Example values: "a3b72e84-1c9a-4f11-9a7e-f4c1b2e91e33" Calculation logic: n/a (ingested from crm.customer) Lineage: crm.customer.customer_id -> stg_crm__customer -> dim_customer.customer_id Quality checks: not_null, unique, valid_uuid Security classification: Low Owner: Data Platform Team Steward: CRM Operations Retention: 7 years Last updated: 2025-09-15
Example 2 — Metric term in glossary
Term: Active Customer Definition: A customer with at least one billable transaction in the last 90 days. Scope: Excludes refunded-only transactions; test accounts excluded. Formula: count_distinct(customer_id) where exists transaction with amount > 0 and transaction_date >= today - 90 days. Time grain: Daily snapshot Source of truth: mart_metrics.active_customers_daily Synonyms: Engaged Customer (deprecated) Owner: VP Sales Operations Steward: Analytics Engineering Approval status: Approved (v1.2 on 2025-10-01) Related terms: Churned Customer
Example 3 — Handling synonyms and conflicts
Situation: Marketing says "Active User" = logged in in last 30 days. Finance says "Active Customer" = billable transaction in last 90 days. Resolution:
- Keep both terms with different names and scopes.
- Add cross-references and a warning: "Not interchangeable."
- Link each term to its source-of-truth dataset and dashboards.
Example 4 — PII tagging in dictionary
Table: dim_customer Column: email Type: string Security classification: PII (contact) Policy: Hash for analytics; restrict raw values to role: customer_support Quality checks: not_null for active customers, valid_email_pattern Owner: Data Protection Officer Last updated: 2025-07-08
Build your first entries (step-by-step)
Step 1 — Pick a small scope
Select 1–2 core tables (e.g., dim_customer, fct_orders) and 2–3 key business terms (Active Customer, Monthly Recurring Revenue).
Step 2 — Draft column entries
For each critical column, fill: label, type, nullability, uniqueness, allowed values, lineage, quality checks, owner, sensitivity.
Step 3 — Draft term entries
Write a one-sentence definition, then detail scope, exact formula, time grain, source-of-truth dataset, synonyms, approvals.
Step 4 — Link and review
Map terms to datasets/columns. Ask a business stakeholder and a data engineer to review for clarity and feasibility.
Step 5 — Publish and version
Publish to your catalog. Record version and last updated. Add a reminder to review quarterly.
Practical projects
- Create a minimal dictionary for two tables with 15 columns fully documented.
- Define 5 glossary terms including one with a non-trivial formula (e.g., MRR with proration rules).
- Add PII classification to all customer attributes and state handling rules.
- Set up a weekly checklist review with one stakeholder from Sales or Finance.
Common mistakes and self-checks
- Mistake: Vague definitions ("recent"). Self-check: Replace with a hard number (e.g., 30 days).
- Mistake: No owner. Self-check: Every entry lists owner and steward.
- Mistake: Missing formula details (e.g., how to treat refunds). Self-check: Add inclusions/exclusions and edge cases.
- Mistake: Not linking terms to datasets. Self-check: Each term references a dataset and key columns.
- Mistake: Ignoring versioning. Self-check: Include version and last updated date.
- Mistake: Unlabeled PII. Self-check: Tag sensitivity and access policy.
Exercises
The exercises below are available to everyone. If you log in, your progress is saved so you can resume later.
Exercise 1: Column dictionary entry (id: ex1)
Write a complete data dictionary entry for a column customer_status in dim_customer.
- Include: label, data type, allowed values, nullability, default, example values, lineage, quality checks, sensitivity, owner, retention, last updated.
- Decide if it’s derived or ingested, and document logic if derived.
Template hint
Table: dim_customer Column: customer_status Label: Type: Nullability: Allowed values: Default: Example values: Calculation logic: Lineage: Quality checks: Security classification: Owner: Retention: Last updated:
- [ ] Allowed values are explicit and mutually exclusive.
- [ ] Quality checks include at least not_null or accepted_values.
- [ ] Lineage lists upstream sources or states "authoritative source".
Exercise 2: Term definition (id: ex2)
Define the business term Monthly Recurring Revenue (MRR).
- Provide: definition, scope, formula, proration/discount treatment, time grain, source of truth, synonyms, owner, steward, approval status.
- State how upgrades/downgrades mid-month are handled.
Formula hint
Sum over active subscriptions: normalized monthly price after discounts, pro-rated by active_days_in_month / days_in_month.
- [ ] The formula is executable (SQL-ready or pseudo-SQL).
- [ ] Exclusions (e.g., free trials) are stated.
- [ ] Source-of-truth dataset is named.
Quick test
Take the quick test below to check your understanding. Available to everyone; log in to save your progress and see it on your dashboard.
Learning path
- Step 1: Document 2 critical tables in the data dictionary.
- Step 2: Define top 5 business terms and link them to datasets.
- Step 3: Add PII classifications and access notes.
- Step 4: Set a review cadence and version your entries.
- Step 5: Expand coverage to the next domain (e.g., billing).
Next steps
- Run a 30-minute review with a business stakeholder to validate one term.
- Add two data quality checks to a critical column and record them in the dictionary.
- Schedule quarterly reviews for your top 10 terms.
Mini challenge
Pick one ambiguous term used in your org (e.g., "Churn"). Draft two competing definitions from different teams’ perspectives. Merge them into one approved definition with precise scope and formula, and link it to specific columns and a source-of-truth dataset.