Why this matters
Indexing and its warehouse equivalents (clustering, partitioning, sort keys, data skipping) make analytical queries 10x–100x faster, reduce compute costs, and keep dashboards snappy. As a Data Architect, you decide how facts and dimensions are organized so that joins, filters, and time-window aggregations run efficiently.
- Real tasks you will face:
- Designing partitioning and clustering for large fact tables by date and key columns.
- Choosing when to add a B-tree or bitmap index to dimension tables for fast lookups.
- Mapping vendor features: indexes vs sort keys, distribution/partitioning, clustering keys, and data skipping/zone maps.
- Balancing read speed with load/CDC performance and storage cost.
Who this is for and prerequisites
- Who this is for: Data Architects, Analytics Engineers, and Data Engineers designing star/snowflake schemas in analytical warehouses.
- Prerequisites:
- Comfort with SQL joins, filters, aggregations.
- Basic understanding of fact/dimension modeling.
- High-level familiarity with at least one warehouse (e.g., columnar or MPP system) or a row-store database.
Concept explained simply
An index is a fast lookup structure to find rows without scanning the whole table. In analytic warehouses, you often get similar speedups through partitioning, clustering/sort keys, and data skipping rather than traditional indexes on huge fact tables.
- Row-store systems (e.g., classic OLTP/OLAP RDBMS): use indexes like B-tree, hash, or bitmap to accelerate joins and filters.
- Columnar warehouses (e.g., many cloud warehouses): rely on partitioning (by date), clustering/sort keys (ordering data), distribution keys, and metadata (zone maps, micro-partitions) to skip reading irrelevant blocks.
Mental model
- Think of the table as a large library:
- Partitioning = shelving by year/month, so you only walk to the right shelf.
- Clustering/Sort keys = sorting books within a shelf by author/title so you grab a tight range instead of flipping every book.
- Indexes (row-store) = a detailed card catalog for fast lookups.
- Data skipping/zone maps = sticky notes on shelves that say the min/max pages there, so you know which shelves to ignore.
Core concepts and when to use them
- B-tree index: best for equality and range queries on selective columns. Great on dimension tables (e.g., dim_customer.email).
- Bitmap index: efficient for low-cardinality columns (e.g., active_flag). Common in some analytical engines; avoid on heavy-write tables.
- Composite/covering index: order columns by selectivity and prefix in your common WHERE and JOIN clauses; include (if supported) projected columns for covering.
- Partitioning: split large tables by date or another monotonic key to enable pruning. Essential for very large fact tables.
- Clustering/Sort keys: physically order data on frequently filtered/joined columns (often date + foreign keys) to improve locality and data skipping.
- Distribution/Shard keys: co-locate data to minimize cross-node shuffles on joins.
- Zone maps/Data skipping/Bloom filters: metadata that helps skip blocks or rows that cannot match a predicate.
Platform mapping cheatsheet
- Row-store (e.g., traditional RDBMS): create indexes on dims; careful on very large facts due to write cost.
- Columnar MPP (generic): prefer partition by date on facts; add clustering/sort keys on (date, join_key); use distribution/shard keys to reduce shuffles; leverage zone maps/bloom/data skipping.
- Some warehouses support optional clustering keys on large tables, which you can define/adjust without changing table schema.
Worked examples
Example 1: Dimension lookup index (row-store style)
Problem: BI joins facts to dim_customer by customer_id and sometimes filters by email.
-- Dimension table index for fast lookup by business key
CREATE INDEX ix_dim_customer_email ON dim_customer(email);
-- Composite index to support common access pattern
CREATE INDEX ix_dim_customer_country_city ON dim_customer(country_code, city);
Result: quick lookups by email and improved locality for country/city filters. Keep indexes lean to avoid slowing down loads.
Example 2: Large fact with partition + clustering (columnar)
Problem: Queries filter by order_date and customer_id. You want pruning and tight ranges.
-- Create table partitioned by order_date and clustered by (customer_id, product_id)
-- Syntax varies by platform; conceptually:
CREATE TABLE fact_orders (
order_id BIGINT,
order_date DATE,
customer_id BIGINT,
product_id BIGINT,
quantity INT,
amount NUMERIC(12,2)
)
PARTITION BY (order_date)
CLUSTER BY (customer_id, product_id);
Effect: date pruning scans only relevant partitions; clustering improves data skipping on customer/product filters.
Example 3: Sort/distribution keys for frequent joins (MPP)
Problem: You often join fact_events to dim_user on user_id and filter last 30 days.
-- Conceptual MPP setup
CREATE TABLE fact_events (
event_ts TIMESTAMP,
event_date DATE,
user_id BIGINT,
event_type VARCHAR,
payload VARIANT
)
DISTKEY(user_id) -- co-locate by user
SORTKEY(event_date, user_id); -- order for pruning & range scans
Effect: Less network shuffle during joins, better pruning by recent dates, and tighter scans around user ranges.
Example 4: Covering index for dashboard (row-store)
Problem: A dashboard repeatedly runs: WHERE status='PAID' AND order_date BETWEEN ... GROUP BY customer_id.
-- Composite index ordered by filter columns, including group-by
CREATE INDEX ix_fact_orders_paid_date_customer
ON fact_orders(status, order_date, customer_id)
INCLUDE (amount);
Effect: The query can be satisfied via index-only access in engines that support INCLUDE/covering, reducing table reads.
Design steps you can follow
- List top queries: common filters, joins, ranges, and group-bys.
- Choose partition column (usually date on big facts).
- Pick clustering/sort keys to align with filter + join columns.
- Add selective indexes to small/medium dimension tables.
- Validate with EXPLAIN/plan and scan metrics; iterate.
Exercises
Note: Everyone can access the quick test and exercises for free. Only logged-in users will have their progress saved.
Exercise 1 — Design partitioning/clustering for query patterns
You have a star schema: fact_sales (200B rows), dim_customer (10M rows), dim_product (500K rows). Typical queries:
- Q1: WHERE sale_date BETWEEN ? AND ? AND customer_id IN (...)
- Q2: WHERE sale_date = ? AND product_id = ?
- Q3: Join fact_sales to dim_customer on customer_id and filter dim_customer.country_code = 'US'
Propose DDL that enables partition pruning and data skipping. Then add a supporting index on dim_customer for the country filter.
Exercise 2 — Create a covering index (row-store)
Given a dashboard query: SELECT customer_id, SUM(amount) FROM fact_payments WHERE status='SETTLED' AND paid_at BETWEEN ? AND ? GROUP BY customer_id;
Design a composite/covering index to speed it up. Justify column order and any included columns.
- Use the tasks above, then compare with the provided solutions:
Show Exercise 1 solution sketch
-- Fact table: partition by date, cluster/sort by (customer_id, product_id)
CREATE TABLE fact_sales (
sale_id BIGINT,
sale_date DATE,
customer_id BIGINT,
product_id BIGINT,
amount NUMERIC(12,2),
tax NUMERIC(12,2)
)
PARTITION BY (sale_date)
CLUSTER BY (customer_id, product_id);
-- Dimension index to support country filters and joins
CREATE INDEX ix_dim_customer_country
ON dim_customer(country_code, customer_id);
-- Rationale:
-- - Partition by sale_date enables pruning for Q1/Q2.
-- - Clustering by (customer_id, product_id) improves data skipping for Q1/Q2.
-- - Country filter is in dim; indexing (country_code, customer_id) speeds join/filter for Q3.
Show Exercise 2 solution sketch
-- Order by most selective filter first, then time, then group-by
CREATE INDEX ix_fact_payments_status_paidat_cust
ON fact_payments(status, paid_at, customer_id)
INCLUDE (amount);
-- Rationale:
-- - status first for equality; paid_at second for range; customer_id supports grouping.
-- - INCLUDE(amount) helps engines perform index-only scans for the SUM.
Self-check checklist
- Did you partition large facts by a column that appears in nearly all time-based filters?
- Do clustering/sort keys reflect the common filter + join columns?
- Are indexes placed on dimensions (not giant facts) where they deliver the most value?
- Did you test with EXPLAIN/scan metrics and confirm data pruning?
- Are write/load costs acceptable after adding these structures?
Common mistakes and how to self-check
- Indexing huge fact tables like OLTP systems. Better: partition + cluster; use indexes primarily on dimensions.
- Choosing the wrong partition key. Symptom: queries still scan most partitions. Fix: partition by the most common time filter (often date).
- Over-indexing. Symptom: slow loads and higher storage. Fix: keep only indexes that measurably reduce scan time.
- Misordered composite indexes. Rule: put equality filters first, then range, then join/group-by columns.
- Ignoring distribution/shard keys in MPP. Symptom: massive shuffles. Fix: co-locate by frequent join keys.
- Not updating stats or reclustering when data skews. Fix: schedule maintenance jobs appropriate to your platform.
Practical projects
- Star schema accelerator: Take a medium dataset (fact + 2 dims). Implement two designs: (a) no partition/cluster, (b) partition by date + cluster by common join key. Compare scan bytes and runtime for 5 queries.
- Dimension lookup tuning: Add and remove different dimension indexes; measure query performance and load impact; write a short tuning report.
- Skew audit: Detect data skew on a key; redesign distribution/clustering to reduce shuffles; validate with execution plans.
Learning path
- Before this: Logical data modeling, star schemas, and query profiling basics.
- Now: Indexing concepts for warehouses (this page).
- Next: Table partitioning strategies, data distribution/sharding, and storage layout tuning per platform.
Next steps
- Apply the checklist to one of your existing tables; measure performance before/after.
- Complete the exercises below and take the quick test.
- Document your chosen partition, clustering, and indexing standards for your team.
Mini challenge
You inherit a warehouse where a 90-day sales dashboard scans the entire fact table daily. In one paragraph, propose a minimal, low-risk change to reduce scan cost by 80% within a day. Mention which structures you would add or adjust and how you would validate success.
Quick Test
Take the quick test below to check your understanding. Available to everyone for free. Only logged-in users will have results saved to their profile.