How to learn Table Optimization And Statistics for Warehouse And Query Performance in Data Platform Engineer for free

Who this is for

Data Platform Engineers and Data Engineers who design, maintain, and tune data warehouse tables so analytics queries run fast and reliably.

Prerequisites

Comfortable with SQL (SELECT, JOIN, GROUP BY).
Basic understanding of columnar warehouses and file formats (e.g., Parquet/ORC) conceptually.
Familiar with ETL/ELT data flows and batch ingestion.

Why this matters

Real tasks you will face:

Deciding partition and clustering keys to cut scan costs.
Compacting thousands of small files into optimal sizes.
Refreshing table statistics so the optimizer picks good join strategies.
Reducing skew and hotspots that slow queries and pipelines.
Keeping materialized aggregates fresh and appropriately sized.

Fast mental check: Are we I/O bound or planning-bound?

If queries read huge amounts of data, focus on partitioning, clustering, and file sizes. If the plan looks odd (wrong joins, huge shuffles), refresh statistics and verify histograms/NDV quality.

Concept explained simply

Table optimization ensures data is stored and organized so the warehouse can skip as much as possible and compress the rest well. Statistics tell the optimizer how much data to expect; good stats lead to good plans.

Mental model

Think of your table like a library:

Partitioning = putting books into labeled rooms by date/region.
Clustering/Sorting/Z-order = arranging shelves so ranges are quick to find.
File sizing/compaction = making each shelf neither too tiny nor too heavy.
Compression/encodings = packing books efficiently.
Statistics = a catalog that says how many books per topic/author.

Key techniques

1) Partitioning (a high-impact first cut)

Choose a column commonly filtered or used for data retention (often a date).
Avoid over-partitioning: too many tiny partitions slow planning and listing.
Typical guidance: daily partitioning for large, continuously ingested fact tables; monthly if daily volume is low.

How to sanity-check a partition key

High selectivity in filters (e.g., last 7 days).
Even distribution across values (avoid all data in one partition).
Predictable data arrival (ingestion aligns with partitions).

2) Clustering/Sort keys (and Z-order-like data skipping)

After partitioning, cluster by high-cardinality columns used in range filters, joins, or common group-bys.
Aim for data skipping: keep similar values together so the engine can prune chunks.
Update clustering periodically as data grows (re-cluster/optimize).

3) File size and compaction

Small files hurt performance (too many metadata and task overheads).
Target medium-large files (e.g., 128–1024 MB per file, engine-dependent).
Schedule compaction/OPTIMIZE jobs to merge small files and rebuild clustering.

4) Compression and column encodings

Use columnar formats and let the warehouse choose defaults, but verify hotspots.
Low-cardinality columns compress well; pre-sort can improve compression.
Re-write tables when changing encodings/ordering for best effect.

5) Table statistics: what and how

Row counts, null counts, min/max, NDV (number of distinct values), histograms.
These guide the optimizer to pick joins, scans, and parallelism wisely.
Refresh after large loads, schema changes, or data reorganization.

Typical stat refresh operations (generic SQL)

-- After major load/compaction
ANALYZE TABLE my_db.sales COMPUTE STATISTICS;
ANALYZE TABLE my_db.sales COMPUTE STATISTICS FOR COLUMNS customer_id, order_date, region;

Names differ per engine, but the intent is the same: recompute table and column stats.

6) Index-like structures and data skipping

Warehouses avoid heavy indexes; they rely on metadata (min/max, bloom filters) and clustering.
Ensure data skipping metadata is present by compacting and maintaining clustering.

7) Materialized aggregates

Use materialized views or summary tables for repetitive, expensive aggregations.
Keep them small and well-clustered; refresh on a cadence aligned to SLAs.

8) Handling skew and hotspots

Use composite clustering keys (e.g., date + customer_id) if a single key is skewed.
Consider salting/bucketing to distribute heavy keys.
Validate with percentiles and top-N distributions.

Worked examples

Example 1: Event table with time-based queries

Scenario: 1B rows/month, most queries filter last 7–30 days. Current table is unpartitioned; scans are huge.

Action: Partition by event_date (daily). Cluster by user_id, then event_type.
Compaction weekly to maintain ~512 MB files.
Refresh stats after each backfill or compaction job.
Result: Queries restricted to last 7–30 days scan only a few partitions with strong data skipping.

Example 2: Sales fact joins are slow after big load

Scenario: A large weekly batch doubled table size; join to customers became nested-loop-like and slow.

Action: Compute table and column stats (especially customer_id and order_date).
If available, re-cluster by order_date, customer_id after compaction.
Result: Optimizer picks a better hash join and prunes more data by date.

Example 3: Skewed customer_id in support tickets

Scenario: 5 large enterprise customers produce 40% of records; queries often filter by ticket_date and customer_id.

Action: Partition by ticket_date (daily). Cluster by customer_tier, customer_id to spread heavy hitters.
Introduce salting for the very largest customers if needed (e.g., customer_id || salt).
Result: Balanced distribution, improved parallelism, less straggler tasks.

Step-by-step optimization playbook

Step 1 — Profile usage

List top filters (columns, time windows).
Identify join keys and group-by columns.
Check query scan sizes and durations.

Step 2 — Pick partitioning

Prefer date/ingest_time for facts if time-based filters dominate.
Avoid heavy partition keys with millions of values.

Step 3 — Choose clustering/sort keys

Use high-cardinality, frequently filtered columns.
Order keys by selectivity and common query patterns.

Step 4 — Fix file sizes

Compact to merge small files; target 128–1024 MB per file (engine-dependent).
Schedule periodic optimize jobs.

Step 5 — Refresh stats

Compute table and column stats after bulk loads and table rewrites.
Verify histograms/NDV are reasonable for key columns.

Step 6 — Validate

Re-run key queries; compare bytes scanned and elapsed time before vs after.
Capture explain plans and ensure expected joins are chosen.

Exercises

Everyone can do the exercises and the quick test. Only logged-in users will have their progress saved.

Exercise 1 — Pick partition and clustering keys

You have a fact_sales table with stats:

Rows: 2.4B
order_date min/max: 2022-01-01 to 2026-01-10
NDV customer_id: ~8.5M
NDV region: 12
NDV sku_id: ~450k
Common filters: WHERE order_date BETWEEN last_30_days; WHERE region IN (...); WHERE customer_id = ?

Decide on partitioning and clustering. Explain trade-offs.

Tip

Favor a date partition if most queries are time-bound. Choose clustering for high-cardinality filters used often.

Exercise 2 — Small files and stale stats

Your ingestion produced 200k files of 5 MB each in the last month’s partitions. Recent queries slowed down and scans increased.

Propose an optimize/compact plan (how often, target size).
Specify when to refresh stats and which columns to include.

Hint

Compact first, then refresh stats so the metadata reflects the new layout.

Exercise 3 — Skewed key mitigation

support_tickets has these observations:

Top 10 customers = 38% of rows
ticket_date filters are common; customer_id filters are common
Straggler tasks appear on big customers

Propose a layout (partition and clustering), and one method to reduce skew. How will you validate the improvement?

Hint

Composite clustering (date + a tier/id), or salting on the heaviest keys. Validate with percentile task time and bytes-per-task distributions.

Checklist for your answers

Partition choice aligns with common filters and data arrival.
Clustering keys target high-cardinality, commonly filtered columns.
Compaction settings produce fewer, larger files.
Stats refresh includes key columns and happens after layout changes.
Validation uses before/after scans, plan changes, and task metrics.

Common mistakes and self-check

Mistake: Over-partitioning by high-cardinality keys. Self-check: Count partitions; do most queries touch dozens/hundreds unnecessarily?
Mistake: Never refreshing stats. Self-check: After major loads, do plans suddenly choose poor joins?
Mistake: Ignoring small-file problems. Self-check: Are there thousands of tiny files in hot partitions?
Mistake: Clustering on low-cardinality columns (e.g., region=12 values). Self-check: Does clustering actually improve data skipping?
Mistake: Fixing layout but not validating. Self-check: Compare bytes scanned and duration before/after on representative queries.

Practical projects

Project A: Take one large fact table, implement date partitioning and a clustering key, compact files, refresh stats, and document before/after query metrics.
Project B: Identify a skewed dimension, introduce salting or composite clustering, and measure task time variance reduction.
Project C: Build a small materialized summary table for a top dashboard and define a refresh and stats policy.

Learning path

Before this: Columnar storage fundamentals, query planning basics.
This topic: Table optimization (partitioning, clustering), file management, and statistics.
Next: Materialized views, caching strategies, and cost-aware query design.

Mini challenge

Pick a slow, high-impact query. Suggest one change to table layout and one stats refresh you would perform. Predict the effect on scanned bytes, join type, and runtime.

Answer guide

Recommend a date partition if time-filtered, add a clustering key aligned to the main filter or join, compact files to medium-large sizes, and refresh stats on key columns. Expect reduced scan bytes, a better hash join choice, and lower runtime.

Next steps

Run the exercises on a staging environment.
Capture explain plans and scan metrics for 3 representative queries.
Schedule periodic compaction and stats refresh aligned with ingestion.

Menu

Table Optimization And Statistics

Table of Contents