Why this matters
Well-modeled data makes BI dashboards fast, accurate, and easy to extend. As a Data Analyst, you will routinely need to: diagnose why a KPI looks wrong, add a new field without breaking existing visuals, join multiple sources while keeping numbers consistent, and make dashboards refresh quickly for stakeholders. Solid data modeling skills turn scattered tables into a reliable, analytical dataset.
Who this is for
- Data Analysts building or maintaining BI dashboards
- Business Analysts who need trustworthy metrics
- Anyone preparing to collaborate with data engineers on analytics-ready datasets
Prerequisites
- Basic SQL (SELECT, JOIN, GROUP BY)
- Comfort with spreadsheet pivot logic (rows, columns, values)
- Familiarity with a BI tool (any) is helpful but not required
Concept explained simply
Data modeling is arranging tables and relationships so analysis is simple and correct. Most BI models use a star schema: one big fact table of events or numeric records, linked to smaller dimension tables that describe those events (like Date, Customer, Product).
Mental model
Imagine a worksheet of transactions (fact) and several lookup sheets (dimensions). You keep the facts tall and skinny (many rows, few computed columns), and the dimensions short and wide (fewer rows, lots of descriptive attributes). You connect lookups to facts with keys so you can slice and aggregate reliably.
Key analogy
Fact table = receipts; Dimension tables = catalogs that describe things on those receipts. The receipt lines store quantities and amounts; catalogs store names, categories, and attributes.
Core building blocks
- Grain: a precise statement of what a single row in your fact represents (for example, one order line).
- Fact table: numeric events at a chosen grain (for example, sales_amount, quantity, cost).
- Dimension tables: descriptive attributes (for example, product_category, customer_segment, region).
- Keys: surrogate keys (integers) link facts to dimensions. Natural keys (like product_code) are kept as attributes.
- Relationships: many-to-one from fact to each dimension; filtering generally flows from dimensions to facts.
- Measures: aggregations defined on fact columns (for example, SUM(sales_amount), AVG(discount)).
- Slowly Changing Dimensions (SCD): preserve history when attributes change; Type 2 adds new rows with effective dates.
Worked examples
Example 1: Retail sales dashboard
- Grain: one row per order line.
- Fact: fact_sales with columns order_id, order_line_id, date_key, store_key, product_key, salesperson_key, quantity, sales_amount, discount_amount.
- Dimensions: dim_date, dim_store, dim_product, dim_salesperson.
- Why this works: You can slice sales by store, product, day, or salesperson, and add new dimensions later without changing the fact’s grain.
Example 2: Support tickets performance
- Goal: Track time to resolution by priority and team.
- Grain: one row per ticket status change (to analyze lifecycle), or simpler: one row per ticket.
- Fact: fact_tickets with ticket_key, created_date_key, resolved_date_key, team_key, priority_key, status, time_to_resolve_hours.
- Dimensions: dim_date, dim_team, dim_priority.
- Tip: If you need lifecycle details (created, assigned, resolved), consider a status-change fact with one row per status transition.
Example 3: Marketing funnel
- Goal: Analyze conversions by campaign and day.
- Option A Grain: one row per session. Measures: session_count, converted_flag.
- Option B Grain: one row per event (page_view, sign_up). Then build measures that count specific events.
- Trade-off: Session grain is simpler; event grain is more flexible for multi-step funnels.
Common mistakes and how to self-check
- Unclear grain: If you cannot describe what one fact row represents in one sentence, stop and define it.
- Snowflaked dimensions in BI: Over-normalizing dims can slow filters. Prefer denormalized star for analysis.
- Mixing periodic snapshots with transactions: Do not sum balances across days; use last-known balance or changes, depending on metric.
- Missing surrogate keys: Relying only on natural keys can break joins when source systems change values.
- Incorrect SCD handling: Overwriting attributes instead of preserving history causes time-travel reports to lie.
Self-check checklist
- I have a one-sentence grain statement.
- All fact-to-dimension links are many-to-one.
- Dimensions contain descriptive attributes; facts contain numeric measures and foreign keys.
- Surrogate keys exist and are unique per dimension row.
- Historical attribute changes are handled (SCD) where needed.
- Aggregations produce expected totals on sample data.
Exercises
Do these directly, then compare with the solutions in the Exercises section below.
- ex1 — Define the grain and dimensions for a retail sales dataset with fields: date, store, product, salesperson, quantity, revenue, discount, return_flag. Write your grain statement and list necessary dimensions and fact columns.
- ex2 — Sketch a star schema for subscriptions (customers on plans). Identify fact_subscriptions columns and dimension tables (customer, plan, date). Note keys and cardinalities.
- ex3 — Design a Product dimension with SCD Type 2. Specify surrogate key and the columns needed to track history.
Hints
- Start from the reporting questions: what will you slice by? Those are dimensions.
- One row = one something. Lock that down first.
- For SCD2, think: effective_from, effective_to, is_current.
Practice checklist
- I wrote a precise grain statement for each fact.
- I separated numeric facts from descriptive attributes.
- I introduced surrogate keys for each dimension.
- I noted which attributes require historical tracking.
- I validated that each relationship is many-to-one from fact to dimension.
Practical projects
- Store Sales Mini-Mart: Build fact_sales and dimensions (date, store, product, cashier). Create measures: Total Sales, Avg Discount %, Returns Count. Validate filters across all dims.
- Ticket SLA Model: Build fact_tickets (one row per ticket) and dims (date, team, priority). Create measures: Avg Resolution Hours, % Breached SLA. Add a calculated column for business hours if needed.
- Marketing Events Model: Build event-grain fact_events with dims (date, campaign, channel, device). Measures: Sessions, Sign-ups, Conversion Rate. Compare with a session-grain version and note pros/cons.
Learning path
- Data Modeling Basics (this lesson)
- Joins and Cardinality in BI models
- Measures and Calculations (for example, DAX or calculated fields)
- Performance tuning: partitions, aggregations, and model size
- Governance: conformed dimensions, data quality checks, documentation
Next steps
- Finish the exercises and confirm with the checklist.
- Take the Quick Test to check understanding.
- Apply the model to a small real dataset at work or a sample CSV.
Mini challenge
Design a model for subscription revenue with upgrades and downgrades. Choose a fact grain (for example, one row per subscription change). List required dimensions (date, customer, plan, reason). Define measures: MRR, Net MRR Change, Churn MRR. Write your grain statement in one sentence.
Quick Test
The quick test below is available to everyone. Only logged-in users have their progress saved.