What you will learn
Compute and Storage Foundations are the core of a reliable data platform. As a Data Platform Engineer, you will choose table formats (Delta, Iceberg, Hudi), design object storage layouts, plan partitions and compaction, run workloads on Spark/Trino/Flink, isolate resources, set autoscaling, and control cost and retention. Mastering this unlocks faster queries, stable pipelines, and predictable bills.
Why this matters for a Data Platform Engineer
- Performance: Proper partitioning, file sizing, and compaction reduce scan costs and speed up jobs.
- Reliability: Table formats with ACID and manifests prevent broken reads and simplify maintenance.
- Isolation: Queues and autoscaling keep high-priority jobs healthy during spikes.
- Cost: Storage layout and compute policies prevent runaway spend.
- Governance: Lifecycle and retention reduce risk and keep storage tidy.
Who this is for
- Data Platform Engineers standing up or improving a lake/lakehouse.
- Data Engineers migrating to table formats on object storage.
- Infra/Platform folks supporting analytics and streaming compute.
Prerequisites
- Basic SQL and comfort with command-line tools.
- Familiarity with cloud object storage concepts (buckets/prefixes).
- High-level understanding of distributed compute (clusters, executors, containers).
Learning path
- Storage layout & naming: Design bucket/prefix and file naming conventions.
- Table formats: Create Delta/Iceberg/Hudi tables and learn their strengths.
- Partitioning & compaction: Choose partition keys, file sizes, and compaction cadence.
- Compute engines: Run queries and jobs on Spark, Trino, and Flink; pick the right engine per workload.
- Isolation & autoscaling: Configure queues, quotas, and autoscaling limits.
- Cost controls: Estimate and cap storage/compute costs.
- Data lifecycle: Set retention and delete strategies safely.
Worked examples
1) Object storage layout and naming
# Example S3-style layout (applies similarly to other object stores)
s3://company-- Use
dt=YYYY-MM-DDpartitions for time-based access. - Include a schema version in filenames for clarity (
orders__v1). - Keep raw immutable; write curated with table formats and compaction.
2) Create a Delta table in Spark SQL with partitioning and optimize
-- Create Delta table
CREATE TABLE delta.`s3://company-Notes: Partition by dt for pruning; use OPTIMIZE or equivalent strategies to compact small files and improve read speed.
3) Create an Iceberg table in Trino and run compaction
-- Create Iceberg table partitioned by day
CREATE TABLE iceberg.analytics.orders (
order_id BIGINT,
customer_id BIGINT,
amount DOUBLE,
ts TIMESTAMP
) WITH (
location = 's3://company-Use Iceberg procedures to rewrite small files and improve scan efficiency.
4) Spark job with dynamic allocation and resource limits
# spark-submit example (Kubernetes or YARN semantics similar)
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=2 \
--conf spark.dynamicAllocation.maxExecutors=50 \
--conf spark.executor.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.sql.files.maxPartitionBytes=134217728 \
--conf spark.sql.shuffle.partitions=400
Set a bounded range for autoscaling, right-size partitions, and keep shuffle partitions aligned with data volume.
5) Flink streaming sink rolling policy (file sizing)
// Pseudocode for rolling files every 128 MB or 15 minutes
StreamingFileSink sink = StreamingFileSink
.forBulkFormat(new Path("s3://company-Balanced file sizes reduce small-file problems and improve downstream reads.
Drills and quick exercises
- Sketch a bucket/prefix layout for raw, curated, marts. Include time partitions.
- Pick a table format for 3 workloads: batch append-only, CDC upserts, and BI queries.
- Decide a partition key for an orders table and justify why.
- Propose compaction size targets (e.g., 128–512 MB files) and daily/weekly cadence.
- Configure Spark dynamic allocation min/max and explain your limits.
- Estimate monthly cost for 20 TB storage and 500 cluster-hours; suggest savings.
- Draft lifecycle rules to expire raw data after 90 days while keeping curated for 2 years.
Common mistakes and debugging tips
- Too many small files: Symptom: slow scans. Fix: increase writer batch sizes, enable compaction, set rolling policies.
- Over-partitioning (e.g., by hour and region): Symptom: many empty partitions. Fix: partition by day; push region to columns or secondary clustering.
- Wrong table format for upserts: Symptom: complex merges, long jobs. Fix: choose Delta/Hudi/Iceberg configuration supporting MERGE and metadata pruning.
- No workload isolation: Symptom: low-priority jobs starve critical SLAs. Fix: queues/quotas, separate pools, and preemption policies.
- Unbounded autoscaling: Symptom: surprise bills. Fix: set max executors/pods and timeouts; scale by SLO, not by raw throughput.
- Forgotten retention rules: Symptom: ballooning storage. Fix: lifecycle policies with safety delays and table-specific retention.
- Compaction conflicts: Symptom: table lock/contention. Fix: schedule compaction off-peak; use table-specific maintenance jobs.
Mini project: Build a small lakehouse slice
- Design storage layout for a transactions dataset with
dtpartitioning and versioned curated zone. - Create the table using Delta or Iceberg; load a few days of data.
- Run a compaction/rewrite job to achieve ~256 MB files.
- Query the table with Spark or Trino; verify partition pruning with a date filter.
- Configure autoscaling (min/max) and a small resource queue for this workload.
- Set lifecycle: raw expires after 30 days, curated retained for 365 days, with a 7-day safety delay.
Validation checklist
- Partition pruning reduces scanned data when filtering by date.
- Average file size falls within 128–512 MB after compaction.
- Jobs do not exceed autoscaling max; critical queries remain responsive.
- Lifecycle rules remove older raw data while keeping curated intact.
Subskills
- Object Storage Layout And Naming — design bucket/prefix structures and file naming that scale.
- Table Formats Delta Iceberg Hudi Basics — create, read, and maintain tables with ACID and metadata.
- Partitioning And Compaction Concepts — choose keys and file sizes; schedule compaction safely.
- Compute Engines Spark Trino Flink Basics — pick the right engine per job and run it efficiently.
- Resource Queues And Workload Isolation — protect SLAs with queues, quotas, and pools.
- Autoscaling Concepts — configure min/max and triggers to match SLOs and budgets.
- Cost Optimization For Storage And Compute — estimate, monitor, and reduce spend.
- Data Lifecycle And Retention Policies — govern deletion, archiving, and version cleanup.
Next steps
- Finish the drills and the mini project.
- Take the skill exam below to check your understanding. Everyone can take it; only logged-in learners have progress saved.
- Apply these patterns to one real dataset in your environment and measure improvements in cost and latency.
Skill exam
Take the exam to validate your knowledge. You can retry. Results are visible to everyone; progress is saved only when logged in.