How to learn Data Modeling For Analytics for ETL ELT Development in Data Engineer for free

Who this is for

Data Engineers and analytics-focused developers who build reliable data layers for BI dashboards, self-serve analysis, and ML features. If you work with warehouses, ELT/ETL pipelines, or metrics layers, this is for you.

Prerequisites

Comfort writing SQL (SELECT, JOINs, GROUP BY, window functions)
Basic understanding of data warehouses/lakes and ETL/ELT layers (staging, curated, marts)
Familiarity with business metrics (e.g., revenue, active users)

Why this matters

In real Data Engineer work you will:

Design star schemas so analysts can answer questions in one query
Define the grain and keys that make joins simple and performant
Handle Slowly Changing Dimensions (SCD) so historical reports remain accurate
Balance cost/performance using partitioning, clustering, and denormalization
Create analytic-ready marts that BI tools understand without complex SQL

Typical on-the-job tasks

Translate product metrics into facts/dimensions with a well-defined grain
Add new attributes to a dimension without breaking existing dashboards
Model event data (page views, clicks) for session and funnel analysis
Build a semantic layer that defines consistent metrics and time filters

Concept explained simply

Data modeling for analytics is organizing raw data into predictable shapes that make common questions fast and easy to answer. You pick a unit of analysis (the grain), store numbers that can be aggregated (facts), and add context (dimensions) that people use to slice and filter. Good models make the “right” query obvious and performant.

Mental model

Think of your data like a board game:

Board squares = facts (each square is one measurable event or state)
Cards = dimensions (describe who/what/when/where about the squares)
Rules = grain (what exactly one row represents)
Score = metrics (how you add up the numbers)

If the rules (grain) are unclear or the cards (dimensions) don’t match the squares (facts), the game breaks—queries become slow or wrong.

Core concepts quickly

Dimensional modeling

Fact tables: numeric measures at a defined grain (e.g., one row per order line)
Dimension tables: descriptive attributes (e.g., product, customer, date)
Grain: the most important decision; every column in a fact must match this level
Star schema: facts at center, dimensions around; simple joins, great for BI

Surrogate keys and Slowly Changing Dimensions (SCD)

Use surrogate keys (integer IDs) for dimensions and reference them in facts
SCD Type 1: overwrite old attribute (no history)
SCD Type 2: create a new row with effective dates to preserve history
SCD Type 0: never change; Type 3: limited history in new columns

Star vs Snowflake vs Wide tables

Star: denormalized dimensions; simplest joins and great performance
Snowflake: normalized dimensions; saves storage but adds joins
Wide tables: single flattened table; fastest for specific workloads but rigid

Kimball, Data Vault, and choosing

Kimball (dimensional): best for analytics consumption and BI
Data Vault: best for ingesting change from multiple systems with auditability; often followed by dimensional marts for analytics
Choose based on team skills, lineage/audit needs, and time-to-insight

Schema-on-write vs schema-on-read

Warehouses (schema-on-write): model first for consistent fast queries
Lakes (schema-on-read): flexible raw storage + curated modeled layers
ELT layering: staging (raw), core (cleaned/conformed), marts (analytics)

Partitioning, clustering, storage formats

Partition facts by date or high-cardinality filters used in queries
Cluster/sort by frequently filtered columns to prune scans
Columnar formats (Parquet/ORC) and compression reduce cost

Semantic models and metrics

Define metrics once (e.g., revenue = SUM(net_amount))
Standardize dimensions (time, product, customer) for consistent slicing
Document filters and time boundaries (e.g., exclude test orders)

Worked examples

Example 1: E-commerce sales star schema

Goal: Daily revenue by product category and channel.

Grain: one row per order_line_id
Fact: fact_order_lines(order_line_id, order_id, order_date_key, product_key, customer_key, channel_key, qty, unit_price, discount, net_amount)
Dimensions: dim_date(date_key, date, week, month, year); dim_product(product_key, sku, name, category, brand); dim_customer(customer_key, segment, region); dim_channel(channel_key, source, medium, campaign)
Metric: revenue = SUM(net_amount)

Why this works

The grain matches the business questions (line-level revenue). Dimensions provide slice-and-dice without changing grain. Surrogate keys keep joins stable even if SKUs or campaigns rename.

Example 2: Event analytics (web/app)

Goal: Sessions and conversion funnels.

Grain: one row per event_id
Fact: fact_events(event_id, occurred_at, user_key, session_key, event_name, page, device_key)
Dimensions: dim_user(user_key, signup_date, plan, country); dim_session(session_key, session_start_at, traffic_source_key); dim_device(device_key, os, browser, app_version)
Derived marts: fact_sessions (one row per session), fact_funnels (one row per user per funnel step)

Key note

Events are noisy. Keep raw events, then derive sessions/funnels so metrics align. Partition by event date for performance.

Example 3: SCD Type 2 for product pricing

Goal: Keep historical price changes for accurate past revenue.

dim_product_price(product_price_key, product_key, list_price, effective_from, effective_to, is_current)
fact_order_lines references product_price_key captured at order time

Why SCD2 here

Past orders should use the price valid at the time of purchase. SCD2 records let reports reflect historical truth.

Try it: Exercises

Complete the exercise below. After you finish, compare with the provided solution and update your model if needed.

Exercise 1 (in this page's Exercises section): Design a subscription analytics model with churn and MRR.

Self-check before viewing solution

Is your fact table's grain explicit and consistent?
Can you compute MRR, churn, and upgrades without row duplication?
Did you choose appropriate SCD types for customer plan changes?

Common mistakes and how to self-check

Unclear grain: If two rows can represent the same event, you’ll double-count. Fix by writing the grain in one sentence at the top of each fact.
Leaky dimensions: Joining a dimension at a different grain (e.g., daily) to a monthly fact causes duplicates. Fix by aligning on conformed keys and grain.
Forgetting surrogate keys: Natural keys change. Use surrogate keys in dimensions and persist them in facts.
Over-normalizing analytics marts: Snowflaking too early hurts usability. Prefer star unless you have a proven need.
Not capturing history: Type 1 updates break historical accuracy. Use SCD2 where downstream reports need “as was.”
Performance blind spots: No partitioning/clustering causes high cost. Partition large facts by date and cluster by common filters.

Quick self-audit checklist

I can write the fact grain in one sentence
Every join path is clear and uses surrogate keys
Metrics definitions are documented and tested
Large facts are partitioned by date
Dimensions with mutable attributes use the right SCD type

Practical projects

Sales Analytics Mart
- Build staging tables for orders, products, customers
- Create dim_date, dim_product (SCD2 for category), dim_customer
- Create fact_order_lines with net_amount and tax
- Validate revenue and order counts against source
Event Funnel Mart
- Ingest raw events, define sessions, derive funnel steps
- Create dim_user, dim_device, fact_events, fact_sessions
- Materialize a funnel table with step timestamps and outcomes
Subscription Metrics Mart
- Create dim_customer, dim_plan (SCD2), fact_subscriptions with MRR
- Materialize a monthly snapshot for churn, upgrades, downgrades

Mini tasks inside projects

Add a surrogate key generator step and test idempotency
Write a query that returns top 5 dimensions by total cost scanned; then add clustering and re-run
Implement a small metrics YAML/JSON that defines revenue and active users, then validate across two different queries

Mini challenge

You ingest refunds 1–7 days after orders. Design how refunds should appear in your sales fact to keep daily revenue trustworthy. Write one sentence for the grain, list new columns needed, and describe how you’ll avoid double counting when refunds arrive late.

Learning path

Master SQL for analytics (window functions, CTEs)
Dimensional modeling (facts/dimensions, grain, SCD)
Performance tuning (partitions, clustering, file formats)
Semantic modeling and metric definitions
Data quality tests and documentation

Next steps

Do the exercise below and check your work
Take the Quick Test to confirm understanding
Note: The test is available to everyone; only logged-in users get saved progress

Instructions

You work for a SaaS product. The business needs monthly MRR, churn rate, and upgrades/downgrades by plan and region.

Define the grain for the core fact(s). Choose either a daily subscription snapshot or one row per subscription event (start, upgrade, cancel).
List your fact and dimension tables. Include surrogate keys.
Specify SCD handling where attributes change (e.g., plan price, customer segment).
Provide column lists for each table (5–12 columns/table is fine). Mark keys and metrics.
Show how to compute MRR, churn, and upgrades in simple SQL or pseudocode.

Constraints:

Late-arriving events are possible (e.g., backfilled cancels)
A customer can hold multiple subscriptions over time
Performance matters for month-level dashboards

Menu

Data Modeling For Analytics

Table of Contents