Why this matters
Picking the right keys and declaring the correct grain is the foundation of stable ETL pipelines. It controls how you deduplicate, join tables, manage slowly changing dimensions (SCD), and aggregate facts. In real ETL work, this prevents broken reports, inflated metrics, and late-night data emergencies.
- Data quality: Unique, stable keys prevent duplicates and orphan records.
- Performance: A clear grain reduces unnecessary duplicates and join explosions.
- Maintainability: Grain statements guide new features and schema changes.
Concept explained simply
Key terms
- Business/Natural Key (NK): The real-world identifier in source data (e.g., employee_number, order_id). May change or not be globally unique.
- Surrogate Key (SK): A generated, system-controlled key (e.g., integer sequence or UUID) used for joins and SCDs.
- Primary Key (PK): The enforced unique key of a table. In warehouses, often the SK for dimensions and a composite for facts.
- Alternate/Unique Keys: Other columns or combinations that must be unique.
- Grain: The exact level of detail of a table. A one-sentence promise about what a single row represents.
Fact grain vs. dimension grain
- Fact table grain: Each row represents an event, transaction line, snapshot, or periodic summary. Examples: "one row per order line" or "one row per store per day".
- Dimension table grain: Each row represents a state of a business entity at a point in time. Examples: "one row per product version (Type 2)" or "one row per current customer (Type 1)".
Grain statement formula
Write it as: "One row represents [entity/event] at [level] per [time boundary if any]." Example: "One row represents a sensor reading per sensor per minute."
Mental model
Think of keys as the row's home address and the grain as the neighborhood zoning rule. The key must uniquely locate the house. The grain declares what kind of building is allowed on each lot. If either is wrong, you get duplicates (two houses at one address) or mixed levels (a warehouse in a residential lot).
How to identify keys and grain (step-by-step)
- List the business entity or event. What is the row about: an order line, a click event, a daily store summary, an employee state?
- Draft a grain statement. One sentence describing the row's level of detail. Keep it explicit.
- Propose a natural key. Which source columns uniquely identify a row at that grain? Consider composite keys.
- Validate uniqueness. Check real data for duplicates and nulls. If unstable, add a surrogate key and dedup rules.
- Decide surrogate keys. Required for dimensions (SCD) and recommended when natural keys are messy.
- Define deduplication logic. Exact match? Key + timestamp window? Hash of relevant columns? State it.
- Plan late-arriving/updated data handling. Upserts, Type 2 changes, audit columns (effective_from, effective_to).
- Document it. Write the grain statement, key columns, dedup rules, and exceptions into your mapping spec.
Edge cases and how to handle them
- Changing natural keys: Maintain a crosswalk table NK -> SK to preserve history.
- Missing or recycled IDs: Combine with other attributes (e.g., business_date, line_number) or introduce an ingestion_id to isolate duplicates.
- Multi-source merges: Align grains first. Conform to the narrowest consistent grain across sources.
- Time rounding: Declare the rule (e.g., event_time truncated to minute) and store raw time too for traceability.
Worked examples
Example 1: E-commerce orders
Goal: Build a fact table for order details.
- Grain statement: One row per order_id per line_number.
- Natural key: (order_id, line_number).
- Surrogate key: fact_order_line_key (optional), useful for faster joins.
- Dedup rule: Keep the latest record by source_updated_at for identical (order_id, line_number).
- Why not header-level grain? Because quantity and price vary per line. Mixing grains causes double counting.
Try it
If cancellations are stored as separate rows with negative quantity, keep the same grain and let measures (quantity, revenue) go negative. Do not add a separate table unless business requires it.
Example 2: IoT sensor readings
Goal: Store periodic readings from sensors that may send multiple bursts per minute.
- Grain statement: One row per sensor_id per minute.
- Natural key: (sensor_id, reading_minute).
- Surrogate key: Optional. Keep raw_event_id for traceability.
- Dedup rule: Aggregate multiple events within the minute (avg_temperature, max_temperature) and keep an event_count.
- Note: Also store the raw events table at event-level grain if needed for audits.
Try it
If the device sends late data, allow updates to the same (sensor_id, minute). Use an upsert keyed by (sensor_id, minute).
Example 3: HR employee dimension (Type 2)
Goal: Track changes in employee attributes over time.
- Grain statement: One row per employee_number per effective period.
- Natural key: employee_number; uniqueness is per period.
- Surrogate key: employee_dim_key (required for Type 2).
- Dedup rule: For the same employee_number and effective_from, keep the latest ingestion.
- Effective dating: effective_from, effective_to, is_current flag.
Try it
Title change? Close the prior row (set effective_to) and insert a new row with a new surrogate key.
Exercises
Practice mirrors the tasks below. Write your grain statement, list keys, and note dedup rules.
You have a clickstream source with columns: event_id, user_id, session_id, event_timestamp_ms, page_url, referrer, event_type. Build a fact table at event-level.
- State your grain precisely.
- Pick a natural key and justify it.
- Describe a dedup strategy for accidental duplicate events.
- Decide whether you need a surrogate key.
Payments system has: payment_id, order_id, method, amount, currency, status, created_at. Refunds table has: refund_id, payment_id, amount, created_at. You want a unified fact for money movements.
- Choose a grain that avoids double counting.
- Define natural keys and any surrogate key needs.
- Describe how to join refunds without inflating totals.
- Write a dedup rule for replays from the source.
Exercise checklist
- Your grain statement is one sentence and unambiguous.
- Natural key columns are available in the source and validated for uniqueness.
- Dedup logic states exactly which columns define duplicates and which record to keep.
- You documented handling for late-arriving or updated records.
Common mistakes and self-check
- Mixing grains: Storing order headers and lines together. Self-check: Can any column have multiple values per key? If yes, you mixed grains.
- Assuming natural keys never change. Self-check: Review history—ever reissued IDs? If yes, add SK and crosswalk.
- Using timestamps alone as a key. Self-check: Are clocks skewed? Add an ID or widen the composite key.
- Ignoring nulls in key parts. Self-check: Count nulls in each key column before enforcing uniqueness.
- Not writing the grain statement. Self-check: If you cannot say it in one sentence, you are not done.
Quick self-audit before building
- Grain statement written and reviewed with stakeholders.
- Uniqueness tested on a sample and a full historical range.
- Dedup and late-arrival rules proven with real edge cases.
- Surrogate keys planned where needed (SCD, messy NKs).
Practical projects
- Build an order-line fact and customer dimension from a mock e-commerce CSV dump. Document grain and keys; implement upserts.
- Create an hourly sensor summary table from raw events, including late data handling and deduplication.
- Model an SCD Type 2 employee dimension with effective dating and SK management. Include a crosswalk for old employee IDs.
Who this is for
- ETL Developers and Data Engineers defining staging, warehouse facts, and dimensions.
- Analysts or Analytics Engineers aligning metrics to a single, consistent level of detail.
Prerequisites
- Basic SQL (joins, group by, window functions).
- Familiarity with dimension modeling concepts (facts, dimensions, SCD).
- Comfort reading source schemas and sample data.
Learning path
- Identify business processes and entities to model (orders, payments, sessions).
- Draft grain statements for each target table and validate with stakeholders.
- Select and test natural keys; decide on surrogate keys.
- Implement dedup rules and late-arrival handling in ETL.
- Backfill historical data and compare counts vs. source to confirm grain.
Mini challenge
You receive a sales summary table with columns: store_id, business_date, total_sales, last_updated_at, and a transaction table with sale_id, store_id, sold_at, amount. Design a daily store fact. Write:
- A grain statement.
- Natural key(s).
- Dedup and late-arrival rules.
- How you would reconcile with the transaction table.
Show a possible answer
Grain: One row per store_id per business_date. NK: (store_id, business_date). Dedup: Keep the latest by last_updated_at for that pair; optionally recompute from transactions and compare to summary. Late arrivals: Upsert the (store_id, business_date) row when late transactions land; maintain a reconciliation delta.
Quick Test
Everyone can take the test for free. If you are logged in, your progress will be saved automatically.
When you are ready, start the quick test below.
Next steps
- Apply these steps to one of your real sources; write the grain statement and validate uniqueness.
- Share your grain and keys with a teammate and stress-test edge cases.
- Automate uniqueness checks and dedup logic in your ETL as data contracts.