Who this is for
- Junior and mid-level data engineers building batch or streaming pipelines.
- Analytics engineers optimizing warehouse tables.
- Anyone writing data to files (e.g., Parquet on object storage) or warehouse tables and wants faster queries and cheaper scans.
Prerequisites
- Basic SQL (SELECT, WHERE, GROUP BY).
- Familiarity with columnar file formats (e.g., Parquet) or analytics warehouses.
- Understanding of batch vs. streaming ingestion at a high level.
Why this matters
Partitioning is how you split large datasets into smaller, organized chunks. Good partitioning reduces scan cost and speeds up queries. In real Data Engineer tasks, you will:
- Design tables so common filters (e.g., date, region) prune data quickly.
- Avoid tiny file explosions by tuning partition counts and compaction.
- Handle late-arriving data and backfills without corrupting downstream tables.
- Keep storage costs predictable with retention and lifecycle rules per partition.
Concept explained simply
Partitioning splits a big table into labeled folders or blocks using a key like event_date or country. When queries filter on that key, the engine reads only matching partitions instead of the whole dataset.
Mental model
Imagine a library. Without partitioning, every time you search, you scan every shelf. With partitioning, shelves are labeled by year or subject; you walk straight to the right aisle and read fewer books.
Key benefits and trade-offs
- Pros: less data scanned, faster queries, simpler retention per partition.
- Cons: too many partitions cause small files; wrong keys cause little pruning; managing late data and schema changes gets trickier.
Core ideas and options
- Horizontal partitioning: split rows by a key (e.g., by day).
- Partition key choices: time (ingest or event time), geography, customer/account, or composite (e.g.,
dt,country). - Partition types:
- Range (e.g., daily, monthly time ranges).
- List (e.g., country in [US, CA, MX]).
- Hash (evenly distributes by a key; used to avoid skew).
- Composite (combine, like date + country).
- Related concepts: bucketing/sharding (hash-based sub-splits), clustering/sorting (organize within partitions), compaction (merge small files).
Choosing a partition key
- Pick keys that appear in WHERE clauses often.
- Keep partition cardinality reasonable. Daily partitions for ≥ millions of rows/day; hourly only if queries need it and volume supports it.
- Avoid very high-cardinality keys (e.g., user_id) as partitions; consider hash bucketing instead.
- Expect skew. If a few partitions get most data, add a secondary key or hash-split hot partitions (e.g.,
dt+hash(user_id)%16). - Plan for late data. Choose event_time vs. ingest_time based on accuracy and lateness behavior; define how to reprocess.
Quick sizing rule of thumb
- Aim for files ~128–1024 MB each (for columnar formats). Smaller files hurt performance.
- Partitions should contain enough rows to produce several large files, not thousands of tiny ones.
Worked examples
Example 1 — Web events
Dataset: 2B rows/day, queries filter by event_date and sometimes by country. Choose daily partitions by event_date, optional sub-split with country if countries are balanced. Add bucketing by user_id to spread within partitions if needed.
Why
- Daily is a natural access pattern and keeps partition count manageable.
- Country filter prunes further; if skewed (e.g., US dominant), consider hash splits for US.
Example 2 — Payments
Dataset: 50M rows/day, audits query monthly. Choose monthly partitions by event_month. For fraud analytics that need recency, add clustering by merchant_id within each month rather than partitioning by merchant.
Why
- Audits scan one or a few months; monthly partitions prune most data.
- Merchant is high-cardinality; partitioning by it would explode partitions and create small files.
Example 3 — IoT telemetry
Dataset: 500M rows/day, late data up to 48 hours. Choose daily partitions by event_date with a late-arrival window overwrite (e.g., reprocess last 2 days hourly). If some devices are extremely chatty, add a hash split inside each day (e.g., hash(device_id)%8).
Checklists
Before you write a partitioned table
- Identify top 3 query filters and their selectivity.
- Estimate daily rows and file target size (e.g., 256 MB).
- Choose time zone standard (UTC recommended).
- Define late data policy (how long to backfill).
- Decide retention per partition (e.g., keep 400 days).
During/after the first load
- Validate partition pruning (queries only scan needed partitions).
- Check small files count; add compaction if needed.
- Verify nulls don’t pile into a catch-all partition.
- Document the partition scheme in the table description.
Common mistakes (and self-check)
- Over-partitioning: Hourly partitions with low volume. Self-check: Does each partition end up with many tiny files?
- Wrong key: Partition by a column rarely used in filters. Self-check: Do queries still scan most partitions?
- Skew: One partition holds most data. Self-check: Compare row counts per partition; hot partitions should be mitigated by hash splits or different grain.
- Time zone drift: Day boundaries misaligned. Self-check: Are boundaries based on UTC and consistent across pipelines?
- Late data ignored: New rows never land in older partitions. Self-check: Do scheduled backfills rewrite a sliding window?
Hands-on: Exercises
Complete these in order. Solutions are hidden below each exercise.
Exercise 1 — Design a partition plan for Orders
Dataset: ~120M orders/month, steady daily volume, top queries filter by order_date and country. Analytics often look at last 90 days; long-term retention is 2 years.
- Propose a partitioning strategy (keys and grain).
- Estimate how many partitions will exist after 2 years.
- State your late-arriving data policy and compaction approach.
Hints
- Prefer time-based primary partition key.
- Consider whether country is balanced or skewed.
- Think about monthly vs. daily trade-offs.
Show solution
Plan: Primary partition by order_date at daily grain. Optional secondary split by country where balanced; if one country dominates, hash-split that country (e.g., country=US/hash(order_id)%8).
Partition count: ~365 × 2 = 730 daily partitions. If also by country for ~20 countries, only create subfolders when present to avoid explosion.
Late data: Backfill sliding window of 3 days hourly; overwrite affected partitions with compaction to target ~256–512 MB file sizes.
Exercise 2 — Write a table DDL (generic)
Create a partitioned fact table for page views with columns: event_time TIMESTAMP, event_date DATE, country STRING, user_id STRING, url STRING. Use daily partitions by event_date. Add a note on compaction.
Hints
- Make
event_datea first-class partition column. - Document expectations: UTC, late data window, target file size.
Show solution
-- Option A: Warehouse-style table with date partition column
CREATE TABLE fact_page_views (
event_time TIMESTAMP,
event_date DATE, -- UTC date derived from event_time
country STRING,
user_id STRING,
url STRING
)
-- Partitioning conceptually by event_date (vendor syntax varies)
;
-- Load guidance (platform-agnostic):
-- 1) Write data grouped by event_date (one batch per day).
-- 2) Compact after load to ~256–512 MB files per partition.
-- 3) Backfill last 2 days every hour to catch late events.
Exercise 3 — Fix small-files
You wrote hourly partitions for a low-volume dataset. Each hour has 5–20 small files (~5 MB). Queries are slow.
- Propose a new partitioning scheme.
- Describe a compaction job to fix existing data.
- Explain how to prevent regression.
Hints
- Consider moving to daily partitions.
- Use merge/compact jobs to rewrite per partition.
- Control writer parallelism or use coalesce before write.
Show solution
Scheme: Switch to daily partitions by event_date. Hourly partitioning is too fine for the volume.
Compaction: For each day, read all files, sort/cluster as needed, and write larger files (256–512 MB). Replace atomically per day.
Prevention: Batch writes by day, cap parallel writers, and run a scheduled compaction for yesterday/today.
Mini challenge
You maintain a 3-year clickstream table. Analysts filter by event_date and sometimes utm_source. utm_source has 5 popular values and 5,000 rare ones. Propose a scheme that gives good pruning without exploding partitions. Write it in one sentence.
Sample answer
Daily partitions by event_date; within each day, optionally list-partition or cluster only on the top 5 utm_source values and treat the long tail via hashing or clustering (not as partitions).
Learning path
- Start here: basics of partition keys, pruning, and file sizing.
- Learn about bucketing and clustering to complement partitioning.
- Master compaction strategies and late data backfills.
- Explore incremental modeling (bronze/silver/gold) and how partitions propagate through layers.
- Practice with streaming windows and partitioned sinks.
Practical projects
- Build a daily-partitioned Parquet dataset for a public log file, including a compaction job and a 3-day backfill loop.
- Design a payment fact table partitioned monthly with a retention policy and a data-quality check that validates partition completeness.
- Migrate an hourly-partitioned dataset to daily partitions; measure query scan reduction and file count before/after.
Next steps
- Refine: Introduce secondary hash splits for hot partitions.
- Automation: Add scheduled compaction and partition-health dashboards (row counts, file counts, skew).
- Reliability: Define clear SLAs for late data and backfills.
Quick Test
Take the quick test below to check your understanding. Note: the test is available to everyone; only logged-in users get saved progress.