Why this matters
As an ETL Developer, your pipelines outlive you. Clear data dictionaries and field definitions make your work discoverable, auditable, and safe to change. They help analysts pick the right fields, reduce support tickets, and prevent breaking changes in production.
- Handover: new team members can understand fields without meetings.
- Governance: business definitions align across reports and models.
- Quality: constraints and valid values catch bad data early.
- Change control: knowing lineage and owners speeds up approvals.
Concept explained simply
What is a data dictionary?
A data dictionary is a catalog describing your datasets and each field: what it means, how it is calculated, where it comes from, how to use it, and who owns it.
Mental model: A contract
Treat each field definition as a contract between data producers and consumers. It states the meaning, allowed values, calculation rules, and reliability guarantees. If the contract changes, you version it and notify consumers.
What to include in a data dictionary
At minimum for each dataset/table and field:
- Table: name, business purpose, refresh schedule, source systems, owner/steward, access level.
- Field: name, business definition, technical data type, nullability, default, units, valid values/ranges, semantic type (e.g., PII, currency, date), calculation/transform rule, lineage/source mapping, constraints and quality checks, sample values, notes.
- Governance: classification (PII/PHI/confidential), retention policy, version and change log.
Tip: naming conventions
- Use snake_case and consistent prefixes/suffixes (e.g., _id, _dt, _amt, _pct).
- Names should reflect business concepts: order_id, customer_tenure_days.
Worked examples
Example 1: Raw table field
- Table: sales_order (Purpose: ecommerce orders; daily snapshot)
- Field: order_id
- Business definition: Unique identifier of a customer order.
- Type: STRING/VARCHAR(36); Null: No; Default: None
- Constraints: Primary key; unique; non-empty.
- Valid values: UUID format.
- Lineage: source.mysql.orders.id
- Sample: "9e0a2d7a-3f1f-4d4c-b5a7-7d2c2ad0a2f9"
- Owner: Commerce Data Steward
Example 2: Derived metric
- Table: f_order_finance (Purpose: financial facts per order)
- Field: gross_margin_pct
- Business definition: Gross margin as a percentage of net revenue.
- Rule: gross_margin_pct = (net_revenue_amt - cogs_amt) / NULLIF(net_revenue_amt, 0)
- Type: DECIMAL(5,4); Range: 0 to 1; Null: Yes (if net_revenue_amt = 0)
- Units: Ratio (0–1)
- Quality checks: net_revenue_amt >= 0; cogs_amt >= 0; gross_margin_pct between 0 and 1.
- Lineage: net_revenue_amt from f_order_finance; cogs_amt from f_order_finance (sourced from ERP.gl_cost)
- Notes: Do not multiply by 100; reporting layer formats as percent.
Example 3: PII handling
- Table: dim_customer
- Field: email_hash
- Business definition: SHA-256 hash of lowercase trimmed customer email.
- Type: STRING(64); Null: Yes (if email absent)
- Classification: Pseudonymized (derived from PII)
- Rule: email_hash = SHA256(LOWER(TRIM(email)))
- Access: allowed for analytics; raw email restricted
- Quality checks: consistent hash length = 64 hex chars.
How to create and maintain your dictionary
- Inventory: list your datasets produced by each pipeline.
- Field pass: for each table, draft field rows with business-first definitions.
- Add rules: document calculation formulas and source mappings.
- Governance: classify PII, set owners, define refresh and retention.
- Quality: define constraints and checks per field.
- Review: validate with a business stakeholder and another engineer.
- Version: stamp v1.0; record changes in a change log section.
- Automate: where possible, sync types and nullability from schemas.
Change log template
- Date | Version | Field | Change | Reason | Impact | Owner
Practice exercises
Do these before the quick test. You can compare with the solutions below.
-
Exercise 1 — Define a derived field: customer_age_years
- Context: Table dim_customer; inputs birth_date (DATE), as_of_date (DATE, daily load date).
- Task: Write the business definition, rule/formula, type, nullability, valid range, quality checks, and lineage.
-
Exercise 2 — Document a KPI: active_subscribers
- Context: Monthly subscription analytics.
- Task: Provide definition, inclusion criteria, rule, source mapping, refresh cadence, owner, and a sample value.
Show solutions
Solutions are also available under each exercise below.
Checklist: a solid field definition
- Clear business meaning in one sentence.
- Exact technical type, units, nullability, and defaults.
- Formula/rule and source lineage are explicit.
- Constraints and valid ranges listed.
- Classification (e.g., PII) and access guidance.
- Owner/steward and refresh cadence noted.
- Sample value provided.
- Version/change log updated.
Common mistakes and self-checks
- Mistake: Business definition copies the formula only. Fix: Start with the what, then the how.
- Mistake: Units not specified (is pct 0–1 or 0–100?). Fix: Always state units and ranges.
- Mistake: Missing null behavior. Fix: Say when values become NULL and why.
- Mistake: Vague lineage. Fix: Name exact source tables/fields and transformations.
- Mistake: No quality checks. Fix: Add constraints that reflect business logic.
- Mistake: Hidden PII. Fix: Classify fields and state access rules.
Self-check mini audit
- Pick 3 fields. Can a new analyst compute them from your rules?
- Do any definitions contradict report dashboards?
- If you change a formula today, do you know who to notify?
Practical projects
- Project 1: Document one production table fully (20–30 fields). Share with an analyst and collect feedback.
- Project 2: Add automated tests for 5 constraints (ranges, uniqueness, referential integrity) referenced in your dictionary.
- Project 3: Introduce versioning for a KPI; record a change log entry and a deprecation notice for old fields.
Who this is for and prerequisites
Who this is for
- ETL Developers creating or maintaining analytical datasets.
- Data engineers preparing handover packages for analysts and BI teams.
Prerequisites
- Basic SQL and understanding of your warehouse types (e.g., INT, DECIMAL, STRING, DATE).
- Knowledge of your source systems and ETL/ELT tools.
Learning path
- Learn naming conventions and typing standards.
- Practice writing business-first field definitions.
- Add lineage and rules; align with stakeholders.
- Integrate quality checks; automate schema metadata where possible.
- Version and maintain change logs.
Next steps
- Finish the exercises below and run the Quick Test at the end.
- Adopt the checklist for every new field you ship.
- Schedule quarterly reviews of key KPIs with business owners.
Mini challenge
Pick any KPI your team argues about. Write a one-paragraph business definition and a precise calculation rule. Ask one analyst and one product manager to review it. Iterate until both agree. Add it to your dictionary with an owner and refresh cadence.
Exercises with solutions
Exercise 1 — Define customer_age_years
Context: dim_customer has birth_date (DATE) and your pipeline sets as_of_date (DATE) on daily runs.
- Write a business definition.
- Specify rule/formula.
- State type, nullability, valid range, quality checks.
- Provide lineage and a sample value.
Show solution
- Business: Customer age in whole years at as_of_date.
- Rule: customer_age_years = FLOOR(DATEDIFF(day, birth_date, as_of_date)/365.2425). If birth_date is NULL, result is NULL.
- Type: INT; Null: Yes
- Range: 0–120
- Quality: birth_date <= as_of_date; customer_age_years between 0 and 120
- Lineage: dim_customer.birth_date; pipeline parameter as_of_date
- Sample: 34
Exercise 2 — Document active_subscribers
Context: Monthly subscription analytics table f_subscriptions_monthly.
- Write business definition and inclusion criteria.
- Write rule, sources, refresh cadence, owner, sample value.
Show solution
- Business: Count of subscribers with an active paid subscription on month_end_date.
- Criteria: status in ('active','grace') AND plan_type != 'trial'.
- Rule: active_subscribers = COUNT(DISTINCT subscriber_id) filtered by criteria for the month.
- Sources: billing.subscriptions.status, plan_type; calendar.month_end_date
- Refresh: Monthly (T+1 day); Owner: Subscriptions Data Steward
- Sample: 128,452
Quick test info
Take the quick test below to check your understanding. Available to everyone. Only logged-in users get saved progress.