Why Data Modeling matters for a BI Developer
Great BI is impossible without a clean data model. As a BI Developer, your dashboards, metrics, and self-serve exploration all depend on a model that is consistent, predictable, and fast. Good models reduce ad-hoc fixes, prevent double counting, and let business users answer their own questions safely.
- Unlocks trustworthy KPIs (one definition, reused everywhere)
- Speeds up dashboards and queries (star schemas, proper grain)
- Enables self-serve analytics (semantic models with business names)
- Simplifies maintenance (clear relationships and change handling)
What you’ll learn
- Designing star schemas for reporting
- Defining the correct grain of fact tables
- Building conformed dimensions and hierarchies
- Handling Slowly Changing Dimensions (SCD) basics
- Designing semantic models for self-serve analytics
- Separating measures from dimensions cleanly
- Using surrogate keys and designing relationships
- Mapping business logic into the model safely
Who this is for
- BI Developers building dashboards, reports, and semantic layers
- Analysts moving from ad-hoc SQL toward robust, reusable models
- Data practitioners who support self-serve analytics in a BI tool
Prerequisites
- Comfort with SQL SELECT, JOIN, GROUP BY, and basic window functions
- Understanding of your business domains (sales, marketing, support, etc.)
- Basic knowledge of your BI tool’s semantic layer and data sources
Learning path (roadmap)
- Capture reporting needs
List core KPIs, supporting dimensions (who/what/when/where), and how users slice results. - Choose a star schema
Identify fact tables (events, transactions, snapshots) and conformed dimensions (Date, Customer, Product, etc.). - Define fact grain
Write one sentence per fact: “One row per <thing> per <time unit>.” Validate against KPIs. - Build dimensions and hierarchies
Add surrogate keys, natural keys, attributes, and hierarchies (e.g., Country → Region). - Handle changes (SCD)
Start with SCD Type 2 for business-critical attributes; add valid_from/valid_to/is_current. - Separate measures vs dimensions
Make numeric aggregations measures; keep descriptors as attributes. Avoid mixed-use fields. - Design the semantic model
Expose friendly names, set default aggregations, hide technical columns, define relationships. - Validate and iterate
Recalculate KPIs, reconcile to source-of-truth, and add tests for counts, uniqueness, and referential integrity.
Quick checks you can do at each milestone
- Can you describe the grain out loud in one sentence?
- Is every measure tied to the correct fact table?
- Are dimensions conformed across facts (one Customer dimension, reused)?
- Are changing attributes handled (SCD) where needed?
- Do default aggregations match business expectations?
Worked examples
Example 1: Sales star schema from requirements
Requirement: Report Revenue, Orders, Units by Product, Customer, and Day. Drill down by Category and Region.
Grain choice: FactSales = one row per order_line_id.
-- Dimensions (surrogate keys, conformed for reuse)
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- e.g., 20250131
date_value DATE NOT NULL,
day_of_week INT,
week_of_year INT,
month INT,
quarter INT,
year INT
);
CREATE TABLE dim_product (
product_sk BIGINT PRIMARY KEY,
product_nk VARCHAR(50) UNIQUE, -- natural key from source
product_name VARCHAR(255),
category VARCHAR(100),
subcategory VARCHAR(100)
);
CREATE TABLE dim_customer (
customer_sk BIGINT PRIMARY KEY,
customer_nk VARCHAR(50) UNIQUE,
customer_name VARCHAR(255),
region VARCHAR(100)
);
-- Fact at order line grain
CREATE TABLE fact_sales (
order_line_id BIGINT PRIMARY KEY,
order_nk VARCHAR(50),
date_key INT NOT NULL REFERENCES dim_date(date_key),
product_sk BIGINT NOT NULL REFERENCES dim_product(product_sk),
customer_sk BIGINT NOT NULL REFERENCES dim_customer(customer_sk),
quantity INT,
unit_price NUMERIC(12,2),
revenue AS (quantity * unit_price) STORED
);
Measures: Revenue (SUM), Units (SUM), Orders (COUNT DISTINCT order_nk).
Why order line grain?
It avoids double counting when multiple products exist per order and supports product-level analysis while still enabling order-level KPIs with COUNT DISTINCT.
Example 2: SCD Type 2 basics for Customer
CREATE TABLE dim_customer_scd (
customer_sk BIGINT PRIMARY KEY,
customer_nk VARCHAR(50),
customer_name VARCHAR(255),
region VARCHAR(100), -- may change over time
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP,
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
-- Pseudocode for an upsert pattern
-- 1) End current record when a change is detected
UPDATE dim_customer_scd d
SET valid_to = NOW(), is_current = FALSE
FROM staging_customer s
WHERE d.customer_nk = s.customer_nk
AND d.is_current = TRUE
AND (d.region IS DISTINCT FROM s.region
OR d.customer_name IS DISTINCT FROM s.customer_name);
-- 2) Insert a new current record
INSERT INTO dim_customer_scd (
customer_sk, customer_nk, customer_name, region, valid_from, valid_to, is_current)
SELECT nextval('customer_sk_seq'), s.customer_nk, s.customer_name, s.region, NOW(), NULL, TRUE
FROM staging_customer s
LEFT JOIN dim_customer_scd d
ON d.customer_nk = s.customer_nk AND d.is_current = TRUE
WHERE d.customer_nk IS NULL
OR (d.region IS DISTINCT FROM s.region
OR d.customer_name IS DISTINCT FROM s.customer_name);
Querying with SCD2
Join facts to the current dimension version with is_current = TRUE, or to a historical version by filtering fact_date BETWEEN valid_from AND COALESCE(valid_to, '9999-12-31').
Example 3: Measures vs dimensions separation
-- View to expose clean measures
CREATE VIEW mart_sales_measures AS
SELECT
fs.product_sk,
fs.customer_sk,
fs.date_key,
SUM(fs.quantity) AS units,
SUM(fs.revenue) AS revenue,
COUNT(DISTINCT fs.order_nk) AS orders,
CASE WHEN SUM(fs.quantity) = 0 THEN NULL
ELSE SUM(fs.revenue) / SUM(fs.quantity) END AS avg_price
FROM fact_sales fs
GROUP BY 1,2,3;
Keep descriptors (like product_name, region) in dimensions; keep aggregations in a measures view or semantic model. This avoids mixing data types and simplifies default aggregations in the BI tool.
Example 4: Semantic model for self-serve
Expose business-friendly names, hide technical keys, and set default aggregations.
-- Example: final semantic view for BI
CREATE VIEW bi_sales AS
SELECT
d.date_value AS Date,
p.product_name AS Product,
p.category AS Category,
c.customer_name AS Customer,
c.region AS Region,
m.units AS Units, -- SUM default
m.revenue AS Revenue, -- SUM default
m.orders AS Orders, -- COUNT default
m.avg_price AS Avg Price -- AVERAGE default (precomputed)
FROM mart_sales_measures m
JOIN dim_date d ON d.date_key = m.date_key
JOIN dim_product p ON p.product_sk = m.product_sk
JOIN dim_customer c ON c.customer_sk = m.customer_sk;
Tip: role-playing dates
If you need Order Date and Ship Date, use multiple foreign keys to the same DimDate and alias columns as Order Date, Ship Date in the semantic view.
Example 5: Many-to-many with a bridge
Customers can belong to multiple Segments; you want Segment analysis on sales.
CREATE TABLE dim_segment (
segment_sk BIGINT PRIMARY KEY,
segment_nk VARCHAR(50) UNIQUE,
segment_name VARCHAR(100)
);
CREATE TABLE bridge_customer_segment (
customer_sk BIGINT NOT NULL REFERENCES dim_customer(customer_sk),
segment_sk BIGINT NOT NULL REFERENCES dim_segment(segment_sk),
PRIMARY KEY (customer_sk, segment_sk)
);
-- Segment-level revenue
SELECT s.segment_name, SUM(fs.revenue) AS revenue
FROM fact_sales fs
JOIN bridge_customer_segment b ON b.customer_sk = fs.customer_sk
JOIN dim_segment s ON s.segment_sk = b.segment_sk
GROUP BY 1;
Avoid double counting
When combining Segment with other many-to-many attributes, restrict to one bridge at a time or pre-aggregate to the appropriate grain to avoid duplication.
Drills and exercises
- Write the grain statement for two facts you own (e.g., “one row per invoice line”).
- List 5 measures and map each to its correct fact table.
- Identify 3 attributes that change slowly; decide if they need SCD2.
- Create a small Date dimension with year/quarter/month/day and a surrogate key.
- Design a simple bridge table for one many-to-many relationship in your domain.
- Rename columns in a semantic view to business-friendly labels.
- Add a default aggregation for every numeric field in your BI tool.
Common mistakes and how to debug
- Undefined grain: Symptoms include inconsistent counts. Fix by writing a one-sentence grain and aligning all joins to it.
- Mixing measures and attributes: Causes wrong aggregations. Separate into measure fields and dimensional attributes.
- No conformed dimensions: Different Customer tables across facts. Create a single dim_customer reused by all facts.
- Ignoring history: Overwrites break historical reports. Use SCD2 for critical attributes (region, segment).
- Many-to-many double counting: Duplicated totals when joining bridges. Aggregate at the right grain or pre-aggregate first.
- Leaking technical keys: Confuses users. Hide surrogate keys in the semantic layer; expose business names.
- Missing default aggregations: Averages where sums are expected. Set explicit defaults (SUM, COUNT, NONE).
Debugging checklist
- Recalculate a KPI three ways (SQL, BI visual, exported CSV) and compare.
- Check referential integrity: any fact rows with missing dimension keys?
- Validate today’s numbers against a trusted baseline (finance close, source app).
- Scan for non-unique natural keys in dimensions; dedupe if needed.
- Test historical joins with BETWEEN valid_from AND valid_to.
Mini project: Sales Analytics Mart
Goal: Build a star schema for sales with a self-serve semantic view that supports Revenue, Orders, Units by Product, Customer, and Date.
- Define grain: one row per order line in fact_sales.
- Create dim_date, dim_product, dim_customer (with SCD2 for region).
- Load fact_sales and compute revenue.
- Add a bridge for customer-to-segment (optional).
- Create a measures view with revenue, orders, units, and avg_price.
- Design a semantic view with friendly names and default aggregations.
- Validate totals against the source system for a chosen period.
Acceptance criteria
- All keys resolve; no orphaned fact rows.
- Revenue total matches within 0.5% of the source for last month.
- Filters by Category and Region work without double counting.
- Both current and historical region analysis are possible.
Subskills
- Star Schema For Reporting — Design facts and conformed dimensions for simple, fast reporting.
- Defining Grain For Facts — Make one clear statement per fact; drive all joins and measures by it.
- Building Dimensions And Hierarchies — Create attributes and drill paths that match how the business thinks.
- Handling Slowly Changing Dimensions Basics — Track attribute history (e.g., region changes) safely.
- Designing Semantic Models For Self Serve — Expose business-friendly fields with correct default aggregations.
- Measures Versus Dimensions Separation — Keep numeric aggregations separate from descriptive attributes.
- Surrogate Keys And Relationship Design — Use stable keys and clear relationships across tables.
- Mapping Business Logic Into The Model — Implement metric definitions and rules once, reuse everywhere.
Next steps
- Practice by modeling one domain end-to-end (orders, subscriptions, or support tickets).
- Take the skill exam below to validate your understanding. Everyone can take it; only logged-in users will see saved progress.
- Iterate with stakeholder feedback and add tests for grain, uniqueness, and relationships.