Who this is for
Analytics Engineers and BI developers who want to build clean, reliable, and reusable SQL models that power dashboards, metrics, and downstream analyses.
Prerequisites
- Comfort writing SELECT queries with JOINs, GROUP BY, and WHERE
- Basic knowledge of CTEs (WITH), views, and indexes
- Familiarity with your warehouse dialect (e.g., casting, date functions)
Why this matters
In real projects, you rarely write one-off SQL. You create models others will reuse: clean customer tables, order facts, daily metrics. Modular, reusable SQL saves hours, reduces defects, and makes changes safer. Examples of real tasks:
- Standardize raw data into staging models so all teams join on the same clean keys.
- Refactor repeated logic (like revenue calculation) into one trusted model.
- Create intermediate models that aggregate and enrich for BI dashboards.
- Ensure models run fast and deterministically during daily refreshes.
Concept explained simply
Modular SQL means splitting complex logic into smaller models with one clear responsibility. Reusable means these models are stable, named consistently, and safe for others to build on.
Mental model
Think in layers:
- Staging (stg_): Clean and standardize raw tables. Cast types, trim strings, fix nulls, rename columns.
- Intermediate (int_): Combine and enrich. Deduplicate, compute metrics, apply business rules.
- Marts (dim_/fact_): Final analytics-friendly tables for BI. Dimensions and facts with documentation-ready columns.
Each layer depends only on earlier layers. Avoid jumping over layers.
Design principles
- DRY (Don't Repeat Yourself): Put shared logic in one place (CTEs, views, or upstream models).
- Single responsibility: One primary purpose per model (e.g., stg_orders only cleans orders).
- Deterministic schemas: Explicitly select and alias columns; avoid SELECT *.
- Predictable naming: Prefix by layer and purpose (stg_, int_, dim_, fact_).
- Idempotent builds: Use create or replace for views; avoid side effects.
- Performance aware: Pre-aggregate where appropriate; ensure keys are unique before joins.
Self-check: Is my model modular?
- Can I describe the model’s single purpose in one sentence?
- Is any complex logic shared moved upstream?
- Are column names explicit and stable?
- Can another analyst reuse this without reading 5 other files?
Useful patterns and building blocks
1) CTEs to organize logic
WITH cleaned AS (
SELECT
order_id,
CAST(order_date AS DATE) AS order_date,
customer_id,
CAST(amount AS NUMERIC(12,2)) AS amount
FROM raw.orders
),
order_totals AS (
SELECT order_id, customer_id, order_date, SUM(amount) AS order_value
FROM cleaned
GROUP BY 1,2,3
)
SELECT * FROM order_totals;
2) Views for reuse
Turn frequently used logic into a view (or materialized view if supported and the query is heavy). This avoids copying logic into many places.
3) Date spine for complete time series
WITH date_spine AS (
SELECT DATEADD(day, seq, (SELECT MIN(order_date) FROM stg_orders)) AS d
FROM generate_series(0, DATEDIFF(day, (SELECT MIN(order_date) FROM stg_orders), (SELECT MAX(order_date) FROM stg_orders))) AS t(seq)
)
SELECT d FROM date_spine;
Join your facts to the spine to include zero-activity days.
4) Safe joins
- Clean keys in staging: TRIM, UPPER, CAST
- Check uniqueness before joining (using COUNT DISTINCT and anti-joins for duplicates)
Quick duplicate check snippet
SELECT customer_id, COUNT(*) AS cnt
FROM stg_customers
GROUP BY 1
HAVING COUNT(*) > 1;
Worked examples
Example 1: From messy raw to reusable stg_orders
Goal: Create a clean staging model for orders, with typed columns and totals per order.
Step-by-step
- Clean and cast raw data
- Aggregate to one row per order
- Expose only stable, well-named columns
WITH cleaned AS (
SELECT
o.order_id,
CAST(o.order_date AS DATE) AS order_date,
o.customer_id,
CAST(oi.item_amount AS NUMERIC(12,2)) AS item_amount
FROM raw.orders o
JOIN raw.order_items oi ON o.order_id = oi.order_id
),
order_totals AS (
SELECT
order_id,
customer_id,
order_date,
SUM(item_amount) AS total_order_value
FROM cleaned
GROUP BY 1,2,3
)
SELECT
order_id,
customer_id,
order_date,
total_order_value
FROM order_totals;
Example 2: Intermediate model for daily revenue (int_daily_revenue)
Goal: Provide a complete daily time series, including zero-revenue days.
Show example
WITH date_spine AS (
SELECT DATEADD(day, seq, (SELECT MIN(order_date) FROM stg_orders)) AS d
FROM generate_series(0, DATEDIFF(day, (SELECT MIN(order_date) FROM stg_orders), (SELECT MAX(order_date) FROM stg_orders))) t(seq)
),
revenue_by_day AS (
SELECT order_date AS d, SUM(total_order_value) AS revenue
FROM stg_orders
GROUP BY 1
)
SELECT
s.d AS date,
COALESCE(r.revenue, 0) AS revenue
FROM date_spine s
LEFT JOIN revenue_by_day r ON r.d = s.d;
Example 3: Dimensions and facts
Goal: Build dim_customers and fact_orders from staging.
Show example
-- dim_customers
WITH base AS (
SELECT
customer_id,
INITCAP(TRIM(customer_name)) AS customer_name,
UPPER(TRIM(country)) AS country,
LOWER(TRIM(email)) AS email
FROM stg_customers
)
SELECT * FROM base;
-- fact_orders
SELECT
o.order_id,
o.customer_id,
o.order_date,
o.total_order_value
FROM stg_orders o;
Exercises
Try these without peeking. Then compare with the solutions.
Exercise 1: Refactor order totals into a reusable staging model
Take repeated revenue logic used across dashboards and move it into a single, clean model. Use CTEs to organize steps and produce one row per order with a stable schema.
Instructions
Start from this messy pattern used in multiple places:
SELECT o.order_id, o.customer_id, o.order_date, SUM(oi.item_amount) AS total
FROM raw.orders o
JOIN raw.order_items oi ON o.order_id = oi.order_id
GROUP BY 1,2,3;
- Refactor into a model with two CTEs: cleaned and order_totals.
- Cast numeric amounts to NUMERIC(12,2) and order_date to DATE.
- Alias the final columns as: order_id, customer_id, order_date, total_order_value.
Expected outcome
A single query returning one row per order with total_order_value, with explicit casts and column aliases. Suitable to become a view named stg_orders.
Hints
- Put all type casting in the cleaned CTE.
- Aggregate only once in order_totals.
- Final SELECT should expose only four columns in stable order.
Show solution
WITH cleaned AS (
SELECT
o.order_id,
CAST(o.order_date AS DATE) AS order_date,
o.customer_id,
CAST(oi.item_amount AS NUMERIC(12,2)) AS item_amount
FROM raw.orders o
JOIN raw.order_items oi ON o.order_id = oi.order_id
),
order_totals AS (
SELECT
order_id,
customer_id,
order_date,
SUM(item_amount) AS total_order_value
FROM cleaned
GROUP BY 1,2,3
)
SELECT
order_id,
customer_id,
order_date,
total_order_value
FROM order_totals;
Exercise 2: Build a daily revenue model using a date spine
Produce a model that outputs one row per day with total revenue, including days with zero revenue.
Instructions
- Create a date_spine CTE generating all dates between MIN and MAX order_date in stg_orders.
- Aggregate revenue from stg_orders to daily level.
- Left join the spine to the revenue and coalesce nulls to zero.
Expected outcome
A query that returns columns date and revenue, covering every date in the range with 0 where applicable.
Hints
- Use generate_series or your warehouse’s equivalent to build the spine.
- COALESCE handles missing days.
- Group by the date column only once.
Show solution
WITH date_spine AS (
SELECT DATEADD(day, seq, (SELECT MIN(order_date) FROM stg_orders)) AS d
FROM generate_series(0, DATEDIFF(day, (SELECT MIN(order_date) FROM stg_orders), (SELECT MAX(order_date) FROM stg_orders))) t(seq)
),
rev AS (
SELECT order_date AS d, SUM(total_order_value) AS revenue
FROM stg_orders
GROUP BY 1
)
SELECT s.d AS date, COALESCE(r.revenue, 0) AS revenue
FROM date_spine s
LEFT JOIN rev r ON r.d = s.d
ORDER BY 1;
Exercise checklist
- All casts are in staging, not mixed into final SELECTs.
- No SELECT * in final models.
- Each model has one clear responsibility.
- Joins use clean keys from staging.
Common mistakes and how to self-check
- Problem: SELECT * makes schemas fragile. Fix: Explicitly select and alias columns.
- Problem: Repeating logic across models. Fix: Move shared logic to a single upstream model or view.
- Problem: Joining on dirty keys. Fix: Standardize in staging (TRIM/UPPER/CAST) and assert uniqueness.
- Problem: Overloaded models doing cleanup + business rules + metrics. Fix: Split into staging and intermediate layers.
- Problem: Missing zero-activity days in trends. Fix: Use a date spine and left join.
Self-audit snippet pack
-- Column drift check
SELECT column_name FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'your_model';
-- Dupes before joining
SELECT key_col, COUNT(*) FROM stg_table GROUP BY 1 HAVING COUNT(*) > 1;
-- Row-level determinism
SELECT COUNT(*) FROM your_model;
Practical projects
- Project 1: Build stg_customers and stg_orders from raw, with documented, typed columns.
- Project 2: Create int_customer_ltv that aggregates customer lifetime value and the order of first purchase.
- Project 3: Publish dim_customers and fact_orders for BI, and a daily KPI model that feeds a dashboard.
Learning path
- Staging fundamentals: typing, trimming, renaming columns.
- Intermediate modeling: aggregations, deduplication, date spines.
- Dimensions and facts: star-schema friendly models.
- Performance tuning: pruning columns, pre-aggregation, sensible materialization.
Next steps
- Add basic SQL assertions as separate validation queries (e.g., key uniqueness, null checks).
- Document every model with purpose, grain, and column definitions.
- Review dependency graph to ensure clean layering (stg_ → int_ → dim_/fact_).
Mini challenge
You have an int_daily_revenue model and a dim_customers model. Create a model daily_revenue_by_country that outputs date, country, revenue. Requirements: join keys are clean, includes zero-revenue days per country, and column names are explicit. Try to solve it in 25 minutes.
Take the quick test
Everyone can take the test. If you log in, your progress is saved and results are stored on your profile.