How to learn Partitioning And Compaction Concepts for Compute And Storage Foundations in Data Platform Engineer for free

Why this matters

As a Data Platform Engineer, you design how data lands and is read at scale. Partitioning reduces the amount of data scanned, and compaction keeps files healthy and fast to read. Together they cut query costs, prevent small-files overload, and ensure predictable performance.

Real tasks you will do: choose partition columns for new tables, set target file sizes, schedule compaction jobs, tune read performance, and fix skewed partitions.
Impact: minutes saved per query across many users, lower storage and compute bills, and fewer incidents due to overloaded metadata or tiny file storms.

Concept explained simply

Partitioning splits a table into folders by key columns (for example, dt=2026-01-11/hr=09). Engines skip partitions that do not match filters, scanning less data. Compaction merges many small files into fewer bigger files so metadata lists faster and readers do less work per query.

Mental model

Imagine a library:

Partitioning is shelving by topic and date so you walk straight to the right aisle.
Compaction is binding loose pages into proper books so you don’t shuffle through scraps.

Key terms (open to expand)

Partition pruning: the engine excludes irrelevant partitions based on your WHERE clause.
Small-files problem: too many tiny files overwhelm metadata and slow readers.
Minor compaction: merge small files within a partition or small scope.
Major compaction: rewrite larger scopes (whole partitions/table) to uniform target sizes.
Target file size: typical 128–1024 MB for columnar files; pick based on engine and workload.

Key ideas you will reuse

Choose partition columns with low-to-medium cardinality that match frequent filters (commonly event date, hour, region, or tenant).
Use multi-level partitioning (e.g., dt, then hr) to balance file counts and scan reduction.
Avoid partitioning by high-cardinality IDs (millions of folders) unless bucketing or clustering is used instead.
Plan for late-arriving data with partition overwrite or merge patterns.
Schedule compaction to keep average file sizes near your target and cap file counts per partition.
Monitor: files per partition, average file size, read amplification (files touched per query), and partition skew.

Worked examples

Example 1: Daily partitioning for event logs

Table: events (ts, user_id, app_id, country, json_payload)

Partition by: dt (derived from ts as date).
Query pattern: most queries filter by a date range. Partition pruning is effective.
Compaction: hourly minor compaction to merge micro-batch files; daily major compaction to standardize file sizes to ~256 MB.
Trade-off: simple layout, but hourly queries may still scan a full day if dt is the only partition.

Example 2: Over-partitioned by user_id (what goes wrong)

Partition by: user_id creates millions of partitions.

Symptoms: huge metadata, slow listing, memory pressure, and tiny files per user.
Fix: switch to dt, optionally hr, and use clustering or bucketing by user_id inside partitions. Compact to target sizes.

Example 3: Transaction table with upserts

Table: orders (order_id, ts, status, amount, country)

Partition by: dt (from ts), optionally country if it is commonly filtered and moderate cardinality.
Compaction: minor compaction after upsert batches to coalesce small delta files; periodic major compaction for uniformity.
Note: order_id is high cardinality and unsuitable as a partition column.

Design steps (checklist)

List top 3 query filters and group-by fields.
Pick 1–2 partition columns aligned with filters and with manageable cardinality.
Choose target file size (e.g., 256–512 MB for Parquet) and max files per partition.
Define compaction policy: minor frequency, major frequency, and rewrite thresholds.
Plan for late data: decide whether to append then compact, or merge with idempotent writes.
Create a monitoring plan: track files/partition, average size, skew, and query scan volumes.

Practical parameters

Target file size: commonly 128–1024 MB for columnar files; choose smaller sizes for low-latency queries or slower networks, larger for batch analytics.
Small-files threshold: treat files under ~32–64 MB as candidates for compaction (tune to your environment).
Partition depth: keep 1–3 levels (e.g., dt, hr). Too deep increases directories and job overhead.
Skew guardrails: alert if a single partition has >10x the average file count or size.

Exercises

Do these to practice. A short checklist is included below each exercise. Solutions are provided in collapsible sections.

Exercise 1: Design a partitioning scheme for clickstream

Dataset: clicks(ts, user_id, session_id, app_id, country, event_type).

Queries: 1) daily active users by app_id and country; 2) hourly traffic spikes; 3) last 7 days monitoring dashboards filtered by dt range and app_id.

Task: propose partition columns and structure (folder layout), and state your target file size and compaction schedule.

Self-check: does it prune for dt and hr when needed?
Self-check: reasonable cardinality and balanced partitions?
Self-check: compaction frequency fits write rate?

Show solution

Proposal:

Partition columns: dt (date from ts), hr (hour from ts). Optional second dimension only if hourly queries are common.
Folder layout: dt=YYYY-MM-DD/hr=HH/
Inside partitions: cluster or bucket by app_id if needed for join patterns.
Target file size: 256–512 MB (Parquet).
Compaction: minor hourly to merge micro-batches within each dt/hr; major daily to rewrite previous day’s partitions to target sizes and limit file count.
Late data: allow updates for last 48 hours; re-compact affected partitions.

Exercise 2: Plan compaction for a small-files surge

Scenario: a backfill created ~50,000 files under 8 MB across last 3 days. Readers are slow.

Task: define thresholds and a two-tier compaction plan with safety controls.

Self-check: threshold for “small file” is clear?
Self-check: target sizes and max files per partition are stated?
Self-check: rollback/verification steps included?

Show solution

Plan:

Thresholds: small file <= 32 MB; target 256–512 MB; cap 500 files per partition after compaction.
Tier 1 (minor): per partition, merge adjacent small files until target size is approached; skip files already >= 70% of target.
Tier 2 (major): for partitions still exceeding 500 files or with average size < 128 MB, rewrite partition to uniform target sizes.
Safety: run in staging mode first on a sample partition; verify row counts and checksums; maintain previous files until validation passes; then atomically swap manifests or commit.
Schedule: run Tier 1 immediately; Tier 2 overnight; re-check metrics next day.

Checklist before you move on

Your scheme prunes most queries you care about.
No extreme cardinality in partition columns.
Compaction policy and thresholds are documented.
Monitoring KPIs are defined.

Common mistakes and how to self-check

Over-partitioning by high-cardinality keys. Self-check: number of partitions does not explode with data volume.
Too many tiny files. Self-check: average file size stays near target; alert when it drops below 50% of target.
Ignoring late data. Self-check: recent partitions allow updates and automatic re-compaction.
No pruning on top queries. Self-check: simulate WHERE clauses and confirm partition predicates are applied.
Compaction without validation. Self-check: validate row counts, sampling, and schema consistency before swapping.

Practical projects

Project 1: Build a partitioned analytics table. Ingest a week of synthetic events, partition by dt/hr, then measure scans with and without WHERE dt filters; compact and re-measure.
Project 2: Skew rescue. Create skewed partitions, then redesign partitioning or add a second level (hr or region) and verify skew metrics improve after compaction.

Mini challenge

You have 30 days of data, partitioned by dt only. Analysts run hourly dashboards and complain about scan costs. Propose one change to partitioning and one compaction policy tweak to cut scans and stabilize performance.

Possible direction

Add hr as a second-level partition or implement clustering by hour inside dt partitions.
Schedule hourly minor compaction and a daily major compaction for previous-day partitions.

Who this is for

Data Platform Engineers and Analytics Engineers who manage lakehouse or warehouse tables.
Anyone responsible for performance, reliability, and cost of data reads/writes.

Prerequisites

Basic understanding of columnar storage (e.g., Parquet) and batch/stream ingestion.
Comfort with SQL WHERE filters and file systems or object storage concepts.

Learning path

Start: Partitioning fundamentals (this lesson) and pruning behavior.
Next: File sizing, small-files avoidance, and compaction strategies.
Then: Late data handling, idempotent writes, and metadata health monitoring.

Next steps

Implement a pilot partitioning scheme on a non-critical table.
Add metrics: average file size, files per partition, scan bytes per query.
Automate minor and major compaction with guardrails and validation.

Quick Test

Take the quick test to check your understanding. Everyone can take it; only logged-in users will see saved progress on retakes.

Menu

Partitioning And Compaction Concepts

Table of Contents