How to learn Table Formats Delta Iceberg Hudi Basics for Compute And Storage Foundations in Data Platform Engineer for free

Why this matters

Modern data platforms store data in files (like Parquet), but teams need database-like features: ACID transactions, time travel, schema evolution, and efficient updates. Table formats such as Delta Lake, Apache Iceberg, and Apache Hudi add these capabilities on top of data lakes.

Reliable pipelines: Ingest CDC streams, upsert late events, and keep tables consistent.
Self-serve analytics: Analysts query a single stable table instead of many raw files.
Cost and performance control: Partitioning, compaction, and snapshot management keep queries fast and storage lean.
Governance: Track schema changes, roll back bad loads, and audit history.

Concept explained simply

A table format is a metadata layer that organizes many data files into a single logical table with transactions. Think of it as “Git for data files.” When you write data, it creates a new snapshot listing which files are active. Reads use a snapshot for a consistent view, even if writes are happening concurrently.

Mental model

Data files: Parquet/ORC files containing rows.
Metadata: Manifests and logs that list which files belong to the table and their statistics.
Snapshots: Each commit creates a new version (like Git commits). You can time-travel to old versions.
Operations: Append, delete, update/upsert, compact, and optimize layout.

Key benefits in one glance

ACID transactions: Prevent partial writes and readers seeing inconsistent data.
Schema evolution: Add columns safely; sometimes rename and drop too (format-dependent).
Time travel: Query data as of a timestamp or version to debug or reproduce results.
Efficient deletes/updates: Use delete files or rewrite strategies without breaking reads.

Delta vs Iceberg vs Hudi (strengths)

Delta Lake: Strong with Spark, simple MERGE for upserts, solid time travel, widely used in batch and streaming with Spark. Features like Z-Ordering (vendor/tool-dependent) can help clustering.
Apache Iceberg: Engine-agnostic design, hidden partitioning, strong schema evolution (rename), equality and position deletes, snapshots/tags/branches for advanced data lifecycle.
Apache Hudi: Write-optimized for frequent upserts/CDC, two storage modes: COW (Copy-On-Write) and MOR (Merge-On-Read), incremental queries for downstream jobs.

When to prefer what

Heavy upserts/CDC with frequent small batches: Start with Hudi.
Multi-engine lakehouse and governance at scale: Iceberg fits well.
Mainly Spark-based with straightforward upserts and time travel: Delta is a great choice.

Core concepts you will use

1) Partitioning

Explicit partitioning: You choose keys like event_date or customer_id.
Hidden partitioning (Iceberg): Queries can benefit from partitions without exposing partition columns directly to users.
Good practice: Partition by low-cardinality, high-selectivity columns (e.g., by day). Avoid over-partitioning.

2) Schema evolution

Add columns: Generally safe.
Rename and reorder: Supported better in Iceberg; other formats vary.
Compatibility: Prefer additive changes. Plan migrations for breaking changes.

3) Deletes and upserts

Deletes: Implemented via delete files (Iceberg position/equality deletes) or file rewrites.
Upserts: Delta uses MERGE INTO; Hudi optimizes upserts (COW/MOR). Iceberg supports MERGE and update/delete semantics across engines.

4) Snapshots, time travel, and rollback

Every write creates a new snapshot/version.
Time travel: Query a prior snapshot by version or timestamp.
Rollback: Reset table pointer to a previous snapshot to recover from bad writes.

5) Maintenance

Compaction/Optimize: Periodically rewrite many small files into fewer larger files.
Vacuum/Expire snapshots: Remove old files that are no longer referenced.
Statistics and clustering: Keep metadata fresh; consider clustering if available.

Worked examples

Example 1: Create an Iceberg table with hidden partitioning by day

-- Pseudo-SQL (engine-specific syntax may vary)
CREATE TABLE prod_events (
  event_id BIGINT,
  event_time TIMESTAMP,
  user_id STRING,
  country STRING,
  payload STRUCT<...>
)
USING iceberg
PARTITIONED BY (days(event_time));

-- Query filters can prune partitions automatically
SELECT country, count(*)
FROM prod_events
WHERE event_time >= '2026-01-01' AND event_time < '2026-01-08'
GROUP BY country;

Why it works: Hidden/transform partitioning keeps queries simple while enabling efficient pruning.

Example 2: Delta Lake upsert with MERGE

MERGE INTO customers AS tgt
USING updates AS src
ON tgt.customer_id = src.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Why it works: MERGE handles both inserts and updates atomically, ideal for CDC batches.

Example 3: Hudi COW vs MOR

-- Copy-On-Write: Writes create new files (simpler reads, heavier writes)
-- Merge-On-Read: Writes append logs; readers merge base + logs (faster writes)

-- Upsert into Hudi table (pseudo config)
-- hoodie.table.type = 'MERGE_ON_READ'
-- hoodie.datasource.write.operation = 'upsert'

Why it works: MOR is great for frequent updates with lower write latency; readers pay some merge cost.

Example 4: Time travel and rollback (Iceberg)

-- List snapshots (engine/SQL varies)
SELECT * FROM prod_events.snapshots;

-- Query historical data
SELECT count(*) FROM prod_events VERSION AS OF 123456789;

-- Roll back to a good snapshot
CALL system.rollback_to_snapshot('prod_events', 123456789);

Why it works: Snapshots let you reproduce past results or quickly recover from bad loads.

Hands-on exercises

Do these in any local or cloud environment that supports Delta/Iceberg/Hudi. If you cannot run code, complete them as design exercises and write your answers.

Exercise 1: Choose a format and define an upsert pipeline

Scenario: You receive hourly CDC for a customers table (5–10% rows change per day). Analysts need fresh data and simple queries.

Pick Delta, Iceberg, or Hudi and justify.
Write a table definition and an upsert/MERGE example.
Explain partitioning choice and maintenance (compaction/vacuum).

Hints

Frequent upserts may favor Hudi or Delta depending on engine and workload.
Partition by a date like updated_at::date to prune daily queries.
Plan for compaction if many small files are created.

Exercise 2: Time travel and cleanup plan

Scenario: A bad load duplicated events yesterday. Design a recovery plan using snapshots/time travel and describe how to safely remove unreferenced files afterward.

Show how to identify the correct snapshot.
Describe the rollback/time travel query.
Specify retention and snapshot expiry settings to reclaim storage.

Hints

List snapshots to find the pre-incident version.
Rollback resets the table pointer; verify with a count check.
Run expire/clean commands only after you are sure readers no longer need the old snapshot.

Checklist before you move on

You can explain table format vs file format.
You know the core strengths of Delta, Iceberg, and Hudi.
You can design partitioning and explain trade-offs.
You understand snapshots, time travel, rollback, and cleanup.
You can outline compaction/optimization steps.

Common mistakes and self-check

Over-partitioning: Too many small files. Self-check: Average file size < 128MB? Consider compaction and coarser partitions.
Ignoring delete semantics: Choosing equality vs position deletes incorrectly. Self-check: Are you deleting many single rows or ranges? Pick the right strategy.
Unsafe schema changes: Renaming/dropping without planning. Self-check: Can all consumers handle the change? Prefer additive changes.
No retention policy: Storage grows endlessly. Self-check: Are old snapshots expired regularly? Is there a legal/audit requirement?
Mixed write engines without coordination: Concurrency issues. Self-check: Confirm engine support and isolation guarantees for the chosen format.

Practical projects

CDC Customers Lakehouse: Build a customers table with daily upserts, time travel checks, and automated compaction.
Event Analytics with Rollback Drill: Ingest events, run a faulty load on purpose, then roll back and verify with historical queries.
Multi-Engine Read: Write with one engine and read with another (e.g., Spark write, Trino/Presto read) to test interoperability.

Suggested steps

Create bronze (raw) and silver (clean) tables.
Design partitioning by date and ensure statistics.
Implement upserts and deletes.
Add compaction and snapshot expiry jobs.
Document time travel and rollback procedures.

Who this is for

Data Platform Engineers establishing lakehouse standards.
Data Engineers building CDC, batch, and streaming pipelines.
Analytics Engineers needing reliable tables for BI.

Prerequisites

Comfort with SQL and basic Spark/engine usage.
Understanding of Parquet/ORC and object storage concepts.
Familiarity with batch vs streaming data flows.

Learning path

Before: File formats (Parquet), partitioning, and object storage basics.
Now: Table format fundamentals (this lesson) and practical operations.
Next: Governance, catalog integration, performance tuning, and cost control.

Next steps

Pick a default table format standard for your platform.
Codify partitioning and retention policies.
Automate compaction and snapshot expiry in orchestration.

Mini challenge

Your marketing_events table receives late-arriving updates and occasional deletes. Design the table in your chosen format, show how you will ingest late updates daily, and how you will delete GDPR-requests efficiently without degrading reads. Include your partitioning strategy and maintenance cadence.

Quick Test

Take the quick test below to check your understanding. Anyone can take it; only logged-in users will have their progress saved.

Menu

Table Formats Delta Iceberg Hudi Basics

Table of Contents