How to learn Defining Grain And Business Processes for Data Modeling Dimensional Star in Analytics Engineer for free

Why this matters

As an Analytics Engineer, you turn raw events into trustworthy tables for BI and Data Science. If you choose the wrong grain or mix multiple business processes in one fact table, you create double-counts, broken KPIs, and painful debugging. Getting grain and process right is the foundation for clean star schemas and fast, reliable insights.

Real tasks you will do: design fact tables for orders, subscriptions, and marketing; document the grain; prevent double-counting across joins; and align KPIs across teams.
Impact: faster dashboards, consistent metrics, easier transformations, and simpler data quality checks.

Progress saving note

You can take the quick test without an account. If you log in, your lesson and test progress will be saved automatically.

Concept explained simply

What is a business process?

A business process is a repeatable activity that produces measurable events (e.g., an order placed, a payment collected, a support ticket resolved). In dimensional modeling, each core business process usually becomes one fact table.

What is grain?

Grain is the exact level of detail of a fact table. It answers: “What does one row represent?” Examples: one row per order, one row per order line, one row per daily subscription, one row per ad impression.

Mental model

Think of grain like the resolution of a photo. The finer the resolution, the more rows you have (more detail). Choose the resolution that exactly matches the questions you need to answer and the way dimensions attach. If you change the resolution mid-stream, the picture gets blurry or distorted.

Good vs. bad grain choices

Good: One row per order line for revenue by product and promotions.
Good: One row per daily active subscription for churn and MRR trends.
Bad: Mixing orders and refunds in the same row with different timestamps.
Bad: One row per session but also storing per-page-view metrics in the same table.

Step-by-step: Define grain and business processes

Step 1 — Name the business process

State the process in verbs and nouns: “Customer places order”, “System bills subscription”, “User views ad”.

Step 2 — List the decisions and KPIs this table must support

Examples: GMV by product, refund rate, CAC by channel, churn rate. This anchors your grain choice.

Step 3 — Inspect source events and entities

Map to events (order_created, order_line_added, payment_captured) and entities (customer, product, store, campaign).

Step 4 — Declare a precise grain sentence

Write it as a complete statement: “One row per order line item (product x in order y) at the time the line is finalized.”

Step 5 — Choose fact table(s) by process

Do not mix processes. Orders, payments, and shipments are separate processes, thus separate fact tables. Use keys to relate them if needed.

Step 6 — Attach dimensions that match the grain

Only join dimensions that can be unambiguously linked at that grain (e.g., product_dim joins to order_line grain, not order header-only).

Step 7 — Validate with sample questions

Run test queries mentally: “Revenue by product?” works at line grain; “Refund rate by store?” may require a refunds fact or a link table.

Step 8 — Document the grain and measures

Write the grain sentence, list of measures, and foreign keys in your model docs. This prevents accidental changes later.

Quick decision helper

If you need product-level revenue and discounts, choose order line grain.
If you need daily churn and MRR, choose daily subscription snapshot grain.
If you need channel attribution at the impression or click, choose event-level grains.
If two processes have different lifecycles (e.g., orders vs. shipments), separate fact tables.

Worked examples

E-commerce: Orders vs. Order Lines

Process: Purchasing.

Option A — Fact Orders (one row per order). Measures: order_subtotal, shipping_fee, order_discount, order_total. Dimensions: customer, store, order_date. Pros: compact. Cons: cannot do product-level analytics without exploding.

Option B — Fact Order Lines (one row per order_id + product_id). Measures: quantity, unit_price, line_discount, line_revenue. Dimensions: product, promotion, category, customer, store, order_date. Pros: fine-grained product analytics. Cons: more rows.

Choice: If product and promo analysis matters, choose order line grain. Often teams keep both: header-level for speed, line-level for detail (clearly documented to avoid double-counting).

Subscriptions: Daily Snapshot

Process: Billing/Entitlement lifecycle.

Grain: One row per subscription per day (only when active), capturing MRR, plan_id, status, start_date, end_date.

Why: Supports churn, retention, and cohort analysis across time. Attach dimensions: customer, plan, region, sales_rep.

Alternative grain: event-based

One row per subscription event (created, upgraded, canceled). Good for event audits but harder for daily KPIs. Often both coexist: fact_subscription_events and fact_subscription_daily.

Marketing: Ad Impressions and Clicks

Processes: Serving an impression, registering a click.

Grain choices:

fact_impressions: one row per impression_id
fact_clicks: one row per click_id
fact_sessions_attributed: one row per session with attributed campaign (downstream)

Keep processes separate. Join via campaign and user/session keys when calculating CTR or attribution. Do not merge impressions and clicks into a single hybrid fact.

Patterns and edge cases

Many-to-many relationships

If a dimension can have multiple values at your chosen grain (e.g., multiple promotions per order), you may need a bridge table or choose a finer grain (order line). Avoid stuffing arrays in a fact table if they can’t be joined deterministically.

Late-arriving facts and dimensions

When a payment confirmation arrives later than the order, keep payments as their own process. Use foreign keys and timestamps to link them chronologically rather than merging them in the order fact.

Slowly changing dimensions

At a line grain, join to the correct dimension version (e.g., product_dim as of order_date) to preserve historical truth. Your grain choice influences how you implement time-aware joins.

Exercises

Complete these mini tasks to practice. Then check your answers below. Your progress saves if you are logged in.

Exercise 1 — Define grain from a mixed event log

Data snippet (conceptual):

order_created(order_id, customer_id, store_id, created_at)
order_line_added(order_id, product_id, qty, unit_price, promo_code)
payment_captured(payment_id, order_id, amount, method, captured_at)

Task:

Name the business process and declare an exact grain for a fact table supporting product revenue and discount analytics.
List candidate measures and foreign keys.
State which dimensions attach cleanly at that grain.

Show solution (Exercise 1)

Process: Purchasing (order line fulfillment).

Grain: One row per order_id + product_id line at the time the line is finalized.

Measures: quantity, unit_price, line_discount, line_revenue (quantity * (unit_price - discount)).

Foreign keys: order_id, product_id, customer_id, store_id, promotion_id (if resolved), order_date_id.

Dimensions: product_dim, promotion_dim, customer_dim, store_dim, date_dim. Payment is a separate process; do not embed payment fields here.

Exercise 2 — Payments and refunds

Data snippet (conceptual):

payment_captured(payment_id, order_id, amount, method, captured_at)
refund_issued(refund_id, payment_id, amount, reason, issued_at)

Task:

Propose fact tables and grains to analyze net collected amount by day, method, and store, without double-counting.
Explain how you would join to orders and stores.

Show solution (Exercise 2)

Two processes, two facts:

fact_payments: one row per payment_id. Measures: gross_collected_amount. Dimensions via FKs: date_dim (captured_at), payment_method_dim, store_dim (via order_id -> order header), customer_dim.
fact_refunds: one row per refund_id. Measures: refund_amount. Dimensions via FKs: date_dim (issued_at), payment_method_dim (from related payment), store_dim, customer_dim.

Net collected by day = sum(gross_collected_amount) minus sum(refund_amount) with consistent date dimensions. Join to store via order header (order_id -> order_dim -> store_id). Keep processes separate to avoid mis-timed netting.

Exercise checklist

You wrote a clear grain sentence in a single line.
You separated processes into different fact tables.
Your chosen dimensions attach unambiguously at that grain.
Your measures are additive or semi-additive as needed.
You can answer the intended KPIs without awkward joins.

Common mistakes and self-check

Mixing processes: merging orders and payments in one fact. Fix by splitting into separate facts and linking by keys.
Unclear grain: “per order but also per product.” Fix by choosing order line grain or keeping a separate header fact.
Ambiguous dimensions: attaching product_dim to an order-level fact. Fix by using line grain or a bridge.
Leaky measures: counting refunds as negative payments in the same row. Fix by separate refund fact.
Time misalignment: joining to the wrong date for trends. Fix by explicit date roles (order_date, ship_date, payment_date).

Self-check prompts

Can I answer “What does one row represent?” in one sentence?
Can I point to the exact events and keys that create one row?
Do all dimensions have a single, deterministic join path at this grain?
Are the measures clearly defined and additive where expected?

Practical projects

Build a mini retail mart: fact_order_lines, fact_payments, fact_refunds, product_dim, customer_dim, date_dim. Document grain for each fact and produce 5 KPI queries.
Create a subscription mart with fact_subscription_events and fact_subscription_daily. Validate churn, expansion, and MRR against sample scenarios.
Model a marketing mart with fact_impressions and fact_clicks plus campaign_dim. Compute CTR and spend by channel and date.

Who this is for, Prerequisites, Learning path

Who this is for

Analytics Engineers, BI Developers, and Data Analysts who design or maintain star schemas and metric layers.

Prerequisites

Basic SQL joins and aggregations
Familiarity with star schemas (facts and dimensions)
Understanding of your source events/entities

Learning path

Start: Defining grain and business processes (this lesson)
Next: Choosing dimensions and surrogate keys
Then: Handling slowly changing dimensions and time
Finally: Validating additivity and building metric layers

Mini challenge

Your company adds partial shipments and backorders. Propose facts and grains to support on-time delivery rate by product and warehouse, without breaking revenue reporting. Write:

Process list (orders, shipments, backorders)
Grain sentences for each fact
Dimensions and measures per fact
How to compute on-time delivery and revenue without double-counting

Hint

Keep shipments as their own process (one row per shipment line). Revenue stays in order line fact; delivery timeliness comes from shipments; link via order_id and product_id.

When you are ready, take the quick test below to lock in the concepts.

Menu

Defining Grain And Business Processes

Table of Contents