luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Column Level Documentation

Learn Column Level Documentation for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

Column level documentation is the fastest way to reduce confusion and rework in analytics. It tells everyone exactly what each field means, how it is calculated, and how to use it safely. As an Analytics Engineer, you will often be asked questions like: "What does order_total include?" or "Is signup_date the marketing signup or the product activation date?" Column docs resolve these questions without meetings.

  • Real task: define a trusted definition for revenue columns so Finance, Sales, and Product report the same numbers.
  • Real task: mark PII columns so downstream users know how to handle them.
  • Real task: standardize boolean and timestamp columns for consistent modeling and BI usage.

Concept explained simply

Think of a table as a form, and each column as a field on that form. Column documentation is the label plus the usage note on each field. It prevents wrong inputs and wrong expectations.

Mental model

Use the 3C model: Clarity, Constraints, Context.

  • Clarity: What is it, in one sentence (non-technical first).
  • Constraints: Type, allowed values, nullability, uniqueness, data quality tests.
  • Context: Where it comes from, how calculated, unit/timezone, examples, steward/owner.

What to document at column level

  • Name (final, user-facing)
  • Data type (and unit if numeric, timezone if timestamp)
  • Nullability and default (and meaning of NULL)
  • Description (concise, business first; tech detail second)
  • Allowed values / enum semantics
  • Constraints (unique, primary key component, foreign key references)
  • Sensitivity / PII classification
  • Source lineage (raw fields, upstream model)
  • Calculation rules (formula, filters, windowing, rounding)
  • Freshness / update cadence (if relevant)
  • Examples (1–2 realistic values)
  • Owner/steward and quality tests present
Template (copy/paste)
# Minimal YAML snippet (adapt as needed)
columns:
  - name: column_name
    description: One-sentence business definition.
    data_type: string|integer|numeric|boolean|timestamp
    unit_or_timezone: USD|EUR|kg|UTC|local_tz (optional)
    nullable: true|false
    default: null|0|'' (optional)
    allowed_values: [list, of, values] (optional)
    constraints:
      unique: false
      primary_key_part: false
      foreign_key: null # e.g., dim_user.user_id
    sensitivity: non-sensitive|pii|restricted
    lineage:
      source_columns: [raw.col_a, raw.col_b]
      upstream_models: [stg_orders]
    calculation: description of formula/filters
    examples: ['value1', 'value2']
    tests: [not_null, accepted_values, unique]
    owner: analytics-eng@company.com

Standards and naming conventions

  • Booleans: use is_ or has_ prefix (is_active, has_discount). Document true and false meanings explicitly.
  • Timestamps: suffix with _at for event time (ordered_at) and _ts for load/processing time (loaded_ts). Always state timezone.
  • Amounts: suffix with _amount and state currency and tax/discount inclusion.
  • Counts: suffix with _count for integers; document whether distinct and the entity counted.
  • Identifiers: suffix with _id; state uniqueness and source system.

Worked examples

Example 1: customers.email
name: email
data_type: string
nullable: false
description: Primary contact email provided by the user at account creation; used for notifications.
constraints:
  unique: true
  primary_key_part: false
sensitivity: pii
lineage:
  source_columns: [raw_users.email]
  upstream_models: [stg_users]
calculation: Lowercased and trimmed; invalid formats filtered out by regex.
examples: ['alex@example.com']
tests: [not_null, unique]
owner: data-platform@company.com

Note the explicit uniqueness and PII classification.

Example 2: orders.order_total_amount
name: order_total_amount
data_type: numeric
unit_or_timezone: USD
nullable: false
description: Final amount charged to the customer for the order after discounts; tax included; shipping excluded.
constraints:
  unique: false
sensitivity: non-sensitive
lineage:
  source_columns: [raw_orders.subtotal, raw_orders.discount, raw_orders.tax]
  upstream_models: [stg_orders]
calculation: (subtotal - discount) + tax; rounded to 2 decimals using bankers rounding.
examples: ['49.99', '0.00']
tests: [not_null]
owner: finance-analytics@company.com
Example 3: events.event_time_at vs events.created_at
name: event_time_at
data_type: timestamp
unit_or_timezone: UTC
nullable: false
description: When the event actually happened on the client device (client-side timestamp).
calculation: Parsed from client payload; adjusted to UTC.

name: created_at
data_type: timestamp
unit_or_timezone: UTC
nullable: false
description: When the event record was first stored by the ingestion service.
calculation: Server write time in UTC.

Two timestamps: one business event time, one ingestion time. Document both and when to use each.

Before/After wording improvement

Before: "Revenue." After: "Revenue after discounts, tax included, shipping excluded, in USD."

Before: "Active flag." After: "is_active: true if the user completed email verification and has not churned in last 30 days; false otherwise."

How to collect column info quickly

Step 1: Inspect schema (types, nulls, constraints) using your warehouse information schema or modeling tool.
Step 2: Ask domain owners short targeted questions: meaning, units, edge cases.
Step 3: Trace lineage: which raw fields and transforms produce the column.
Step 4: Add examples and tests; confirm boolean semantics and timestamp timezone.
Step 5: Peer review with a consumer (analyst/PM) for clarity.
Helpful SQL snippets
-- Find columns without descriptions (sample pattern)
select table_schema, table_name, column_name
from information_schema.columns c
left join my_doc_store d
  on d.table_name = c.table_name and d.column_name = c.column_name
where d.description is null;

-- Check null ratios to inform documentation
select column_name,
       100.0 * sum(case when col is null then 1 else 0 end)/count(*) as null_pct
from my_table
unpivot(col for column_name in (col1, col2, col3))
group by column_name;

Quality gates and automation

  • Not null tests for keys and required business fields.
  • Accepted values for enums (status, country_code lists).
  • Uniqueness for identifiers and natural keys.
  • Freshness checks when columns rely on upstream feeds.
  • CI rule: block merges if new columns lack description, type, and nullability.
Automation tips
  • Generate initial stubs from warehouse schema; fill description and context manually.
  • Annotate models so BI tools inherit column docs automatically.
  • Use a linter to enforce naming prefixes (is_, _at, _amount).

Exercises

Do these now. The Quick Test is available to everyone; log in to save your progress later.

Exercise 1: Write column docs from a model

Given a model user_metrics with columns:

user_id (integer, not null, unique)
email (string, not null)
signup_date (date, not null)
is_active (boolean, not null)
lifetime_value (numeric, nullable)

Write column-level docs for email, signup_date, is_active, lifetime_value using the template above.

Need a hint?
  • Define units for numeric fields and the meaning of NULL.
  • For is_active, state exactly what true and false mean.
  • For signup_date, clarify if it is marketing signup or product activation.

Exercise 2: Fix ambiguous definitions

Given current docs:

orders.status: 'order status'
orders.discount_amount: 'discount on order'
orders.ordered_at: 'order time'

Rewrite with precise business meaning, units/timezone, and edge cases.

Need a hint?
  • List allowed status values and transitions.
  • Say if discount includes promo codes and gift cards.
  • State timezone for ordered_at and whether it is customer event time or server time.

Checklist before you publish

  • One-sentence business definition is clear and specific.
  • Type, nullability, unit/timezone documented.
  • Boolean semantics spelled out (true means…, false means…).
  • Enum values and meanings listed.
  • Source lineage and calculation equation provided.
  • Examples realistic and unambiguous.
  • Sensitivity/PII classification present.
  • Tests attached for critical constraints.

Common mistakes and self-check

  • Vague descriptions. Self-check: can a non-technical stakeholder choose the right field without asking you?
  • Missing timezone/units. Self-check: if someone changes currency, would your doc still be clear?
  • Not defining NULL meaning. Self-check: write one sentence: "NULL means …" If you cannot, clarify with owners.
  • Boolean ambiguity. Self-check: write two bullet points, one for true, one for false.
  • Drift after changes. Self-check: PR checklist includes "Update column docs" whenever SQL changes output.

Practical projects

  • Take your top 3 BI dashboards and backfill complete column docs for their source models.
  • Create an enum dictionary for all status-like columns (orders, shipments, invoices) with accepted values and meanings.
  • Add or fix data tests that enforce your documented constraints for at least 5 critical columns.

Who this is for, prerequisites, and learning path

Who this is for

  • Analytics Engineers owning models consumed by BI and stakeholders.
  • Data Analysts who need consistent field meanings.
  • BI Developers curating semantic layers.

Prerequisites

  • Basic SQL and understanding of warehouse data types.
  • Familiarity with your modeling tool (e.g., transformations, tests).

Learning path

  • Start: Column Level Documentation (this page).
  • Next: Table-level and model contracts (constraints and tests).
  • Then: Semantic layer definitions and BI field descriptions.

Next steps

  • Document 10 highest-impact columns today using the template.
  • Add CI rules to fail builds when new columns lack docs.
  • Run the Quick Test below to check your understanding.

Mini challenge

Pick one messy model and improve 5 column descriptions so a new analyst can compute revenue and churn KPIs without asking you questions. Timebox: 30 minutes.

Practice Exercises

2 exercises to complete

Instructions

Using the template, write column docs for these fields in user_metrics:

  • email
  • signup_date
  • is_active
  • lifetime_value

Include: description, data_type, nullable, unit/timezone (if any), boolean semantics, meaning of NULL, lineage, calculation, examples, tests, sensitivity.

Expected Output
YAML-like documentation blocks for four columns that include clear business definitions, types, nullability, unit/timezone where relevant, boolean semantics, NULL meaning, lineage and calculation notes, one or two examples, and suggested tests.

Column Level Documentation — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Column Level Documentation?

AI Assistant

Ask questions about this tool