Why this matters
As an Analytics Engineer, you turn raw events into trustworthy tables for BI and Data Science. If you choose the wrong grain or mix multiple business processes in one fact table, you create double-counts, broken KPIs, and painful debugging. Getting grain and process right is the foundation for clean star schemas and fast, reliable insights.
- Real tasks you will do: design fact tables for orders, subscriptions, and marketing; document the grain; prevent double-counting across joins; and align KPIs across teams.
- Impact: faster dashboards, consistent metrics, easier transformations, and simpler data quality checks.
Progress saving note
You can take the quick test without an account. If you log in, your lesson and test progress will be saved automatically.
Concept explained simply
What is a business process?
A business process is a repeatable activity that produces measurable events (e.g., an order placed, a payment collected, a support ticket resolved). In dimensional modeling, each core business process usually becomes one fact table.
What is grain?
Grain is the exact level of detail of a fact table. It answers: “What does one row represent?” Examples: one row per order, one row per order line, one row per daily subscription, one row per ad impression.
Mental model
Think of grain like the resolution of a photo. The finer the resolution, the more rows you have (more detail). Choose the resolution that exactly matches the questions you need to answer and the way dimensions attach. If you change the resolution mid-stream, the picture gets blurry or distorted.
Good vs. bad grain choices
- Good: One row per order line for revenue by product and promotions.
- Good: One row per daily active subscription for churn and MRR trends.
- Bad: Mixing orders and refunds in the same row with different timestamps.
- Bad: One row per session but also storing per-page-view metrics in the same table.
Step-by-step: Define grain and business processes
Step 1 — Name the business process
State the process in verbs and nouns: “Customer places order”, “System bills subscription”, “User views ad”.
Step 2 — List the decisions and KPIs this table must support
Examples: GMV by product, refund rate, CAC by channel, churn rate. This anchors your grain choice.
Step 3 — Inspect source events and entities
Map to events (order_created, order_line_added, payment_captured) and entities (customer, product, store, campaign).
Step 4 — Declare a precise grain sentence
Write it as a complete statement: “One row per order line item (product x in order y) at the time the line is finalized.”
Step 5 — Choose fact table(s) by process
Do not mix processes. Orders, payments, and shipments are separate processes, thus separate fact tables. Use keys to relate them if needed.
Step 6 — Attach dimensions that match the grain
Only join dimensions that can be unambiguously linked at that grain (e.g., product_dim joins to order_line grain, not order header-only).
Step 7 — Validate with sample questions
Run test queries mentally: “Revenue by product?” works at line grain; “Refund rate by store?” may require a refunds fact or a link table.
Step 8 — Document the grain and measures
Write the grain sentence, list of measures, and foreign keys in your model docs. This prevents accidental changes later.
Quick decision helper
- If you need product-level revenue and discounts, choose order line grain.
- If you need daily churn and MRR, choose daily subscription snapshot grain.
- If you need channel attribution at the impression or click, choose event-level grains.
- If two processes have different lifecycles (e.g., orders vs. shipments), separate fact tables.
Worked examples
E-commerce: Orders vs. Order Lines
Process: Purchasing.
Option A — Fact Orders (one row per order). Measures: order_subtotal, shipping_fee, order_discount, order_total. Dimensions: customer, store, order_date. Pros: compact. Cons: cannot do product-level analytics without exploding.
Option B — Fact Order Lines (one row per order_id + product_id). Measures: quantity, unit_price, line_discount, line_revenue. Dimensions: product, promotion, category, customer, store, order_date. Pros: fine-grained product analytics. Cons: more rows.
Choice: If product and promo analysis matters, choose order line grain. Often teams keep both: header-level for speed, line-level for detail (clearly documented to avoid double-counting).
Subscriptions: Daily Snapshot
Process: Billing/Entitlement lifecycle.
Grain: One row per subscription per day (only when active), capturing MRR, plan_id, status, start_date, end_date.
Why: Supports churn, retention, and cohort analysis across time. Attach dimensions: customer, plan, region, sales_rep.
Alternative grain: event-based
One row per subscription event (created, upgraded, canceled). Good for event audits but harder for daily KPIs. Often both coexist: fact_subscription_events and fact_subscription_daily.
Marketing: Ad Impressions and Clicks
Processes: Serving an impression, registering a click.
Grain choices:
- fact_impressions: one row per impression_id
- fact_clicks: one row per click_id
- fact_sessions_attributed: one row per session with attributed campaign (downstream)
Keep processes separate. Join via campaign and user/session keys when calculating CTR or attribution. Do not merge impressions and clicks into a single hybrid fact.
Patterns and edge cases
Many-to-many relationships
If a dimension can have multiple values at your chosen grain (e.g., multiple promotions per order), you may need a bridge table or choose a finer grain (order line). Avoid stuffing arrays in a fact table if they can’t be joined deterministically.
Late-arriving facts and dimensions
When a payment confirmation arrives later than the order, keep payments as their own process. Use foreign keys and timestamps to link them chronologically rather than merging them in the order fact.
Slowly changing dimensions
At a line grain, join to the correct dimension version (e.g., product_dim as of order_date) to preserve historical truth. Your grain choice influences how you implement time-aware joins.
Exercises
Complete these mini tasks to practice. Then check your answers below. Your progress saves if you are logged in.
Exercise 1 — Define grain from a mixed event log
Data snippet (conceptual):
- order_created(order_id, customer_id, store_id, created_at)
- order_line_added(order_id, product_id, qty, unit_price, promo_code)
- payment_captured(payment_id, order_id, amount, method, captured_at)
Task:
- Name the business process and declare an exact grain for a fact table supporting product revenue and discount analytics.
- List candidate measures and foreign keys.
- State which dimensions attach cleanly at that grain.
Show solution (Exercise 1)
Process: Purchasing (order line fulfillment).
Grain: One row per order_id + product_id line at the time the line is finalized.
Measures: quantity, unit_price, line_discount, line_revenue (quantity * (unit_price - discount)).
Foreign keys: order_id, product_id, customer_id, store_id, promotion_id (if resolved), order_date_id.
Dimensions: product_dim, promotion_dim, customer_dim, store_dim, date_dim. Payment is a separate process; do not embed payment fields here.
Exercise 2 — Payments and refunds
Data snippet (conceptual):
- payment_captured(payment_id, order_id, amount, method, captured_at)
- refund_issued(refund_id, payment_id, amount, reason, issued_at)
Task:
- Propose fact tables and grains to analyze net collected amount by day, method, and store, without double-counting.
- Explain how you would join to orders and stores.
Show solution (Exercise 2)
Two processes, two facts:
- fact_payments: one row per payment_id. Measures: gross_collected_amount. Dimensions via FKs: date_dim (captured_at), payment_method_dim, store_dim (via order_id -> order header), customer_dim.
- fact_refunds: one row per refund_id. Measures: refund_amount. Dimensions via FKs: date_dim (issued_at), payment_method_dim (from related payment), store_dim, customer_dim.
Net collected by day = sum(gross_collected_amount) minus sum(refund_amount) with consistent date dimensions. Join to store via order header (order_id -> order_dim -> store_id). Keep processes separate to avoid mis-timed netting.
Exercise checklist
- You wrote a clear grain sentence in a single line.
- You separated processes into different fact tables.
- Your chosen dimensions attach unambiguously at that grain.
- Your measures are additive or semi-additive as needed.
- You can answer the intended KPIs without awkward joins.
Common mistakes and self-check
- Mixing processes: merging orders and payments in one fact. Fix by splitting into separate facts and linking by keys.
- Unclear grain: “per order but also per product.” Fix by choosing order line grain or keeping a separate header fact.
- Ambiguous dimensions: attaching product_dim to an order-level fact. Fix by using line grain or a bridge.
- Leaky measures: counting refunds as negative payments in the same row. Fix by separate refund fact.
- Time misalignment: joining to the wrong date for trends. Fix by explicit date roles (order_date, ship_date, payment_date).
Self-check prompts
- Can I answer “What does one row represent?” in one sentence?
- Can I point to the exact events and keys that create one row?
- Do all dimensions have a single, deterministic join path at this grain?
- Are the measures clearly defined and additive where expected?
Practical projects
- Build a mini retail mart: fact_order_lines, fact_payments, fact_refunds, product_dim, customer_dim, date_dim. Document grain for each fact and produce 5 KPI queries.
- Create a subscription mart with fact_subscription_events and fact_subscription_daily. Validate churn, expansion, and MRR against sample scenarios.
- Model a marketing mart with fact_impressions and fact_clicks plus campaign_dim. Compute CTR and spend by channel and date.
Who this is for, Prerequisites, Learning path
Who this is for
Analytics Engineers, BI Developers, and Data Analysts who design or maintain star schemas and metric layers.
Prerequisites
- Basic SQL joins and aggregations
- Familiarity with star schemas (facts and dimensions)
- Understanding of your source events/entities
Learning path
- Start: Defining grain and business processes (this lesson)
- Next: Choosing dimensions and surrogate keys
- Then: Handling slowly changing dimensions and time
- Finally: Validating additivity and building metric layers
Mini challenge
Your company adds partial shipments and backorders. Propose facts and grains to support on-time delivery rate by product and warehouse, without breaking revenue reporting. Write:
- Process list (orders, shipments, backorders)
- Grain sentences for each fact
- Dimensions and measures per fact
- How to compute on-time delivery and revenue without double-counting
Hint
Keep shipments as their own process (one row per shipment line). Revenue stays in order line fact; delivery timeliness comes from shipments; link via order_id and product_id.
When you are ready, take the quick test below to lock in the concepts.