Why this matters
As a BI Analyst, you need dashboards and ad‑hoc queries to respond fast even on large datasets. Pre‑aggregation turns expensive group-bys on billions of rows into instant lookups. This is critical for:
- Daily/weekly dashboards for execs (revenue, conversion, retention).
- Recurring queries with the same dimensions (date, region, product).
- High concurrency (many users hitting the same metrics).
- Cost control by reducing repeated heavy scans.
Real-world task examples
- Prepare daily category-level sales so KPIs load under 2 seconds.
- Precompute DAU/WAU/MAU to avoid re-counting users each time.
- Create weekly rollups by market and channel for finance close.
Concept explained simply
Pre‑aggregation means computing summaries ahead of time and storing them for quick reuse. Instead of summing millions of rows every time, you store totals by a chosen grain (e.g., day x region x product_category).
Mental model
Think of it as packing snacks before a hike. You could cook on the trail (raw tables), but it’s slow and costly. Pre‑aggregation is your prepped snack: a ready-to-eat summary table. The more you pre‑pack (coarser grain), the faster you move—but you have less flexibility if you need something different later.
When to use pre‑aggregation
- Repeated queries share the same GROUP BY dimensions.
- Stable dimensions (date, product hierarchy, geography).
- Some freshness delay is acceptable (minutes to hours).
- High concurrency or cost pressure from heavy scans.
Core strategies
- Summary tables: Precompute metrics by a fixed grain (e.g., day x product_category).
- Materialized views: Engine-managed pre-aggregates with refresh logic.
- Rollups: Multiple levels of granularity (day → week → month; city → region → country).
- Incremental refresh: Only recompute affected partitions (e.g., last 1–7 days).
- Approximate distincts: Use sketches (e.g., HyperLogLog) for fast unique counts.
- Top‑N precomputation: Store top categories/channels per period for instant ranking queries.
- Denormalized aggregates: Join stable dimensions once so queries avoid repeated joins.
Trade-offs to remember
- Speed vs flexibility: Coarser grain is faster but less flexible.
- Freshness: Pre-aggregates can be slightly stale; define SLA (e.g., hourly).
- Storage: Aggregates consume space; prune old partitions.
- Complexity: More rollups mean more maintenance; automate and document.
Design checklist (use before building)
- Questions: What exact metrics and dimensions do users need repeatedly?
- Grain: Choose the lowest grain that covers your dashboards (e.g., day x region x category).
- Measures: SUM, COUNT, AVG, MIN/MAX, distinct users (approx if needed).
- Dimensions: Use stable IDs; include hierarchies you will roll up on.
- Refresh: Incremental window + full rebuild cadence; define freshness SLA.
- Partitioning: Typically by date for fast refresh and pruning.
- Validation: Compare sample outputs to raw queries; set tolerance thresholds.
- Naming: Consistent prefixes (agg_, mv_, summary_); add metadata columns (data_start, data_end, last_refreshed_at).
Worked examples
Example 1 — E‑commerce daily revenue by category and channel
Goal: Speed up dashboard widgets aggregating by day, product_category, and marketing_channel.
-- Grain: day x product_category_id x marketing_channel
CREATE TABLE agg_sales_day_cat_channel AS
SELECT
order_date::date AS day,
d.product_category_id,
f.marketing_channel,
SUM(f.revenue) AS revenue,
SUM(f.qty) AS units,
COUNT(DISTINCT f.order_id) AS orders
FROM fact_orders f
JOIN dim_product d ON f.product_id = d.product_id
WHERE order_date >= CURRENT_DATE - INTERVAL '180 days'
GROUP BY 1,2,3;
Refresh: Hourly incremental (recompute last 3 days), plus a weekly full rebuild to correct late-arriving data.
Why it works: Most dashboards slice exactly by these dimensions. Aggregate size drops from hundreds of millions of rows to a few tens of thousands.
Example 2 — SaaS DAU/WAU/MAU rollups
Goal: Fast user activity metrics across day, week, and month.
-- Daily active users (DAU) by product_tier
CREATE TABLE agg_dau_tier AS
SELECT
activity_date::date AS day,
u.product_tier,
COUNT(DISTINCT u.user_id) AS dau
FROM fact_user_activity a
JOIN dim_user u ON a.user_id = u.user_id
GROUP BY 1,2;
-- Rollups (week, month) from daily
CREATE TABLE agg_wau_tier AS
SELECT date_trunc('week', day) AS week, product_tier, SUM(dau) AS wau
FROM agg_dau_tier GROUP BY 1,2;
CREATE TABLE agg_mau_tier AS
SELECT date_trunc('month', day) AS month, product_tier, SUM(dau) AS mau
FROM agg_dau_tier GROUP BY 1,2;
Refresh: Rebuild daily table incrementally (last 2 days), then derive week/month tables from it. Keep all three synchronized.
Example 3 — Top‑N channels per week
Goal: Show top 5 acquisition channels per region each week instantly.
CREATE TABLE agg_top_channels_week AS
WITH weekly AS (
SELECT date_trunc('week', signup_date) AS week,
region,
channel,
COUNT(*) AS signups
FROM fact_signups
GROUP BY 1,2,3
)
SELECT * FROM (
SELECT week, region, channel, signups,
ROW_NUMBER() OVER (PARTITION BY week, region ORDER BY signups DESC) AS rn
FROM weekly
) t
WHERE rn <= 5;
Refresh: Weekly rebuild or incremental weekly window. This reduces heavy ranking work at query time.
Who this is for
- BI Analysts who own dashboards and recurring KPI reports.
- Analytics Engineers optimizing warehouse cost and speed.
- Data-savvy PMs seeking faster self-serve analytics.
Prerequisites
- Comfort with SQL (joins, GROUP BY, window functions).
- Basic understanding of star schemas (facts and dimensions).
- Familiarity with your warehouse partitioning and refresh options.
Try it: Exercises
Do these before the quick test. Use the checklist to self‑review.
- Design a daily sales aggregate — See Exercise 1 below.
- Plan a multi‑level rollup — See Exercise 2 below.
Self‑review checklist
- Grain supports all required dashboard slices.
- Measures are additive or correctly handled (distincts approximated if needed).
- Refresh plan covers late-arriving data.
- Partitions enable fast incremental loads.
- Validation queries compare aggregates to raw for a sample period.
Common mistakes and how to self‑check
- Too granular or too coarse: If queries still scan huge aggregates, you picked grain too fine. If users constantly ask for missing breakdowns, you picked too coarse.
- Ignoring late data: Without a reprocessing window, numbers drift. Self‑check: Recompute last N days and compare deltas.
- Unbounded growth: Not pruning old partitions. Self‑check: Track table size; enforce retention.
- Incorrect distinct counts: Using SUM of flags when deduping needed. Self‑check: Spot‑check against raw with exact count distinct for a small time window.
- Double counting after joins: Pre‑join only stable one‑to‑one dimensions. Self‑check: Validate row counts after joins on a sample.
Practical projects
- Build a revenue cube: day x region x product_category with weekly and monthly rollups, plus a freshness dashboard.
- Design a user engagement aggregate (DAU/WAU/MAU) with incremental refresh and anomaly alerts on refresh lag.
- Create a top‑N marketing sources per week aggregate and wire it to a dashboard ranking component.
Learning path
- Foundation: Identify repeated queries and required dimensions.
- Design: Choose grain, measures, and refresh policy; define validation.
- Build: Implement summary tables/materialized views with partitions.
- Scale: Add rollups and approximate distincts; document SLAs.
- Operate: Monitor freshness, size, and query latency; iterate.
Next steps
- Finish the exercises below and validate your outputs.
- Run the Quick Test to confirm understanding.
- Apply one pre‑aggregation to a real dashboard this week and measure latency improvement.
Mini challenge
You’re told a KPI tile is slow: “Monthly revenue by country and device.” Sketch a minimal aggregate that fixes it without overbuilding. Define: grain, measures, refresh window, and how you would validate correctness in 10 minutes.
Exercises (Detail)
Exercise 1 — Daily sales aggregate for the KPI dashboard
Use a fact table like fact_orders(order_id, order_ts, product_id, store_id, marketing_channel, qty, revenue) and dimensions dim_product(product_id, product_category_id) and dim_store(store_id, region). Design a pre‑aggregation that supports dashboard tiles: daily revenue, units, and orders by product category and region, filterable by marketing channel.
Hints
- Pick grain: day x product_category_id x region x marketing_channel (or decide if channel belongs as a filter vs dimension in the table).
- Partition by day; incremental refresh last 3–7 days.
- Validate by comparing a recent day against raw queries.
Exercise 2 — Multi‑level rollup plan
Create daily DAU by product_tier, then roll up to week and month. Add a plan for refresh order and staleness monitoring.
Hints
- Build daily first; derive week/month from daily.
- Incrementally refresh the last 2 days for daily; regenerate downstream rollups after daily updates.
- Expose last_refreshed_at in each table.
Quick Test is available to everyone; only logged‑in users will have their progress saved.