luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Data Modeling For Analytics

Learn Data Modeling For Analytics for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Who this is for

Data Engineers and analytics-focused developers who build reliable data layers for BI dashboards, self-serve analysis, and ML features. If you work with warehouses, ELT/ETL pipelines, or metrics layers, this is for you.

Prerequisites

  • Comfort writing SQL (SELECT, JOINs, GROUP BY, window functions)
  • Basic understanding of data warehouses/lakes and ETL/ELT layers (staging, curated, marts)
  • Familiarity with business metrics (e.g., revenue, active users)

Why this matters

In real Data Engineer work you will:

  • Design star schemas so analysts can answer questions in one query
  • Define the grain and keys that make joins simple and performant
  • Handle Slowly Changing Dimensions (SCD) so historical reports remain accurate
  • Balance cost/performance using partitioning, clustering, and denormalization
  • Create analytic-ready marts that BI tools understand without complex SQL
Typical on-the-job tasks
  • Translate product metrics into facts/dimensions with a well-defined grain
  • Add new attributes to a dimension without breaking existing dashboards
  • Model event data (page views, clicks) for session and funnel analysis
  • Build a semantic layer that defines consistent metrics and time filters

Concept explained simply

Data modeling for analytics is organizing raw data into predictable shapes that make common questions fast and easy to answer. You pick a unit of analysis (the grain), store numbers that can be aggregated (facts), and add context (dimensions) that people use to slice and filter. Good models make the “right” query obvious and performant.

Mental model

Think of your data like a board game:

  • Board squares = facts (each square is one measurable event or state)
  • Cards = dimensions (describe who/what/when/where about the squares)
  • Rules = grain (what exactly one row represents)
  • Score = metrics (how you add up the numbers)

If the rules (grain) are unclear or the cards (dimensions) don’t match the squares (facts), the game breaks—queries become slow or wrong.

Core concepts quickly

Dimensional modeling

  • Fact tables: numeric measures at a defined grain (e.g., one row per order line)
  • Dimension tables: descriptive attributes (e.g., product, customer, date)
  • Grain: the most important decision; every column in a fact must match this level
  • Star schema: facts at center, dimensions around; simple joins, great for BI

Surrogate keys and Slowly Changing Dimensions (SCD)

  • Use surrogate keys (integer IDs) for dimensions and reference them in facts
  • SCD Type 1: overwrite old attribute (no history)
  • SCD Type 2: create a new row with effective dates to preserve history
  • SCD Type 0: never change; Type 3: limited history in new columns

Star vs Snowflake vs Wide tables

  • Star: denormalized dimensions; simplest joins and great performance
  • Snowflake: normalized dimensions; saves storage but adds joins
  • Wide tables: single flattened table; fastest for specific workloads but rigid

Kimball, Data Vault, and choosing

  • Kimball (dimensional): best for analytics consumption and BI
  • Data Vault: best for ingesting change from multiple systems with auditability; often followed by dimensional marts for analytics
  • Choose based on team skills, lineage/audit needs, and time-to-insight

Schema-on-write vs schema-on-read

  • Warehouses (schema-on-write): model first for consistent fast queries
  • Lakes (schema-on-read): flexible raw storage + curated modeled layers
  • ELT layering: staging (raw), core (cleaned/conformed), marts (analytics)

Partitioning, clustering, storage formats

  • Partition facts by date or high-cardinality filters used in queries
  • Cluster/sort by frequently filtered columns to prune scans
  • Columnar formats (Parquet/ORC) and compression reduce cost

Semantic models and metrics

  • Define metrics once (e.g., revenue = SUM(net_amount))
  • Standardize dimensions (time, product, customer) for consistent slicing
  • Document filters and time boundaries (e.g., exclude test orders)

Worked examples

Example 1: E-commerce sales star schema

Goal: Daily revenue by product category and channel.

  • Grain: one row per order_line_id
  • Fact: fact_order_lines(order_line_id, order_id, order_date_key, product_key, customer_key, channel_key, qty, unit_price, discount, net_amount)
  • Dimensions: dim_date(date_key, date, week, month, year); dim_product(product_key, sku, name, category, brand); dim_customer(customer_key, segment, region); dim_channel(channel_key, source, medium, campaign)
  • Metric: revenue = SUM(net_amount)
Why this works

The grain matches the business questions (line-level revenue). Dimensions provide slice-and-dice without changing grain. Surrogate keys keep joins stable even if SKUs or campaigns rename.

Example 2: Event analytics (web/app)

Goal: Sessions and conversion funnels.

  • Grain: one row per event_id
  • Fact: fact_events(event_id, occurred_at, user_key, session_key, event_name, page, device_key)
  • Dimensions: dim_user(user_key, signup_date, plan, country); dim_session(session_key, session_start_at, traffic_source_key); dim_device(device_key, os, browser, app_version)
  • Derived marts: fact_sessions (one row per session), fact_funnels (one row per user per funnel step)
Key note

Events are noisy. Keep raw events, then derive sessions/funnels so metrics align. Partition by event date for performance.

Example 3: SCD Type 2 for product pricing

Goal: Keep historical price changes for accurate past revenue.

  • dim_product_price(product_price_key, product_key, list_price, effective_from, effective_to, is_current)
  • fact_order_lines references product_price_key captured at order time
Why SCD2 here

Past orders should use the price valid at the time of purchase. SCD2 records let reports reflect historical truth.

Try it: Exercises

Complete the exercise below. After you finish, compare with the provided solution and update your model if needed.

  • Exercise 1 (in this page's Exercises section): Design a subscription analytics model with churn and MRR.
Self-check before viewing solution
  • Is your fact table's grain explicit and consistent?
  • Can you compute MRR, churn, and upgrades without row duplication?
  • Did you choose appropriate SCD types for customer plan changes?

Common mistakes and how to self-check

  • Unclear grain: If two rows can represent the same event, you’ll double-count. Fix by writing the grain in one sentence at the top of each fact.
  • Leaky dimensions: Joining a dimension at a different grain (e.g., daily) to a monthly fact causes duplicates. Fix by aligning on conformed keys and grain.
  • Forgetting surrogate keys: Natural keys change. Use surrogate keys in dimensions and persist them in facts.
  • Over-normalizing analytics marts: Snowflaking too early hurts usability. Prefer star unless you have a proven need.
  • Not capturing history: Type 1 updates break historical accuracy. Use SCD2 where downstream reports need “as was.”
  • Performance blind spots: No partitioning/clustering causes high cost. Partition large facts by date and cluster by common filters.
Quick self-audit checklist
  • I can write the fact grain in one sentence
  • Every join path is clear and uses surrogate keys
  • Metrics definitions are documented and tested
  • Large facts are partitioned by date
  • Dimensions with mutable attributes use the right SCD type

Practical projects

  1. Sales Analytics Mart
    • Build staging tables for orders, products, customers
    • Create dim_date, dim_product (SCD2 for category), dim_customer
    • Create fact_order_lines with net_amount and tax
    • Validate revenue and order counts against source
  2. Event Funnel Mart
    • Ingest raw events, define sessions, derive funnel steps
    • Create dim_user, dim_device, fact_events, fact_sessions
    • Materialize a funnel table with step timestamps and outcomes
  3. Subscription Metrics Mart
    • Create dim_customer, dim_plan (SCD2), fact_subscriptions with MRR
    • Materialize a monthly snapshot for churn, upgrades, downgrades
Mini tasks inside projects
  • Add a surrogate key generator step and test idempotency
  • Write a query that returns top 5 dimensions by total cost scanned; then add clustering and re-run
  • Implement a small metrics YAML/JSON that defines revenue and active users, then validate across two different queries

Mini challenge

You ingest refunds 1–7 days after orders. Design how refunds should appear in your sales fact to keep daily revenue trustworthy. Write one sentence for the grain, list new columns needed, and describe how you’ll avoid double counting when refunds arrive late.

Learning path

  • Master SQL for analytics (window functions, CTEs)
  • Dimensional modeling (facts/dimensions, grain, SCD)
  • Performance tuning (partitions, clustering, file formats)
  • Semantic modeling and metric definitions
  • Data quality tests and documentation

Next steps

  • Do the exercise below and check your work
  • Take the Quick Test to confirm understanding
  • Note: The test is available to everyone; only logged-in users get saved progress

Practice Exercises

1 exercises to complete

Instructions

You work for a SaaS product. The business needs monthly MRR, churn rate, and upgrades/downgrades by plan and region.

  • Define the grain for the core fact(s). Choose either a daily subscription snapshot or one row per subscription event (start, upgrade, cancel).
  • List your fact and dimension tables. Include surrogate keys.
  • Specify SCD handling where attributes change (e.g., plan price, customer segment).
  • Provide column lists for each table (5–12 columns/table is fine). Mark keys and metrics.
  • Show how to compute MRR, churn, and upgrades in simple SQL or pseudocode.

Constraints:

  • Late-arriving events are possible (e.g., backfilled cancels)
  • A customer can hold multiple subscriptions over time
  • Performance matters for month-level dashboards
Expected Output
A clear schema proposal with: (1) grain statement, (2) a star schema of 1–2 facts and 3–5 dimensions, (3) SCD choices, (4) example SQL for MRR/churn/upgrades, (5) brief note on handling late-arriving data.

Data Modeling For Analytics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data Modeling For Analytics?

AI Assistant

Ask questions about this tool