Why this matters
Modern data platforms store data in files (like Parquet), but teams need database-like features: ACID transactions, time travel, schema evolution, and efficient updates. Table formats such as Delta Lake, Apache Iceberg, and Apache Hudi add these capabilities on top of data lakes.
- Reliable pipelines: Ingest CDC streams, upsert late events, and keep tables consistent.
- Self-serve analytics: Analysts query a single stable table instead of many raw files.
- Cost and performance control: Partitioning, compaction, and snapshot management keep queries fast and storage lean.
- Governance: Track schema changes, roll back bad loads, and audit history.
Concept explained simply
A table format is a metadata layer that organizes many data files into a single logical table with transactions. Think of it as “Git for data files.” When you write data, it creates a new snapshot listing which files are active. Reads use a snapshot for a consistent view, even if writes are happening concurrently.
Mental model
- Data files: Parquet/ORC files containing rows.
- Metadata: Manifests and logs that list which files belong to the table and their statistics.
- Snapshots: Each commit creates a new version (like Git commits). You can time-travel to old versions.
- Operations: Append, delete, update/upsert, compact, and optimize layout.
Key benefits in one glance
- ACID transactions: Prevent partial writes and readers seeing inconsistent data.
- Schema evolution: Add columns safely; sometimes rename and drop too (format-dependent).
- Time travel: Query data as of a timestamp or version to debug or reproduce results.
- Efficient deletes/updates: Use delete files or rewrite strategies without breaking reads.
Delta vs Iceberg vs Hudi (strengths)
- Delta Lake: Strong with Spark, simple MERGE for upserts, solid time travel, widely used in batch and streaming with Spark. Features like Z-Ordering (vendor/tool-dependent) can help clustering.
- Apache Iceberg: Engine-agnostic design, hidden partitioning, strong schema evolution (rename), equality and position deletes, snapshots/tags/branches for advanced data lifecycle.
- Apache Hudi: Write-optimized for frequent upserts/CDC, two storage modes: COW (Copy-On-Write) and MOR (Merge-On-Read), incremental queries for downstream jobs.
When to prefer what
- Heavy upserts/CDC with frequent small batches: Start with Hudi.
- Multi-engine lakehouse and governance at scale: Iceberg fits well.
- Mainly Spark-based with straightforward upserts and time travel: Delta is a great choice.
Core concepts you will use
1) Partitioning
- Explicit partitioning: You choose keys like event_date or customer_id.
- Hidden partitioning (Iceberg): Queries can benefit from partitions without exposing partition columns directly to users.
- Good practice: Partition by low-cardinality, high-selectivity columns (e.g., by day). Avoid over-partitioning.
2) Schema evolution
- Add columns: Generally safe.
- Rename and reorder: Supported better in Iceberg; other formats vary.
- Compatibility: Prefer additive changes. Plan migrations for breaking changes.
3) Deletes and upserts
- Deletes: Implemented via delete files (Iceberg position/equality deletes) or file rewrites.
- Upserts: Delta uses MERGE INTO; Hudi optimizes upserts (COW/MOR). Iceberg supports MERGE and update/delete semantics across engines.
4) Snapshots, time travel, and rollback
- Every write creates a new snapshot/version.
- Time travel: Query a prior snapshot by version or timestamp.
- Rollback: Reset table pointer to a previous snapshot to recover from bad writes.
5) Maintenance
- Compaction/Optimize: Periodically rewrite many small files into fewer larger files.
- Vacuum/Expire snapshots: Remove old files that are no longer referenced.
- Statistics and clustering: Keep metadata fresh; consider clustering if available.
Worked examples
Example 1: Create an Iceberg table with hidden partitioning by day
-- Pseudo-SQL (engine-specific syntax may vary)
CREATE TABLE prod_events (
event_id BIGINT,
event_time TIMESTAMP,
user_id STRING,
country STRING,
payload STRUCT<...>
)
USING iceberg
PARTITIONED BY (days(event_time));
-- Query filters can prune partitions automatically
SELECT country, count(*)
FROM prod_events
WHERE event_time >= '2026-01-01' AND event_time < '2026-01-08'
GROUP BY country;
Why it works: Hidden/transform partitioning keeps queries simple while enabling efficient pruning.
Example 2: Delta Lake upsert with MERGE
MERGE INTO customers AS tgt
USING updates AS src
ON tgt.customer_id = src.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Why it works: MERGE handles both inserts and updates atomically, ideal for CDC batches.
Example 3: Hudi COW vs MOR
-- Copy-On-Write: Writes create new files (simpler reads, heavier writes)
-- Merge-On-Read: Writes append logs; readers merge base + logs (faster writes)
-- Upsert into Hudi table (pseudo config)
-- hoodie.table.type = 'MERGE_ON_READ'
-- hoodie.datasource.write.operation = 'upsert'
Why it works: MOR is great for frequent updates with lower write latency; readers pay some merge cost.
Example 4: Time travel and rollback (Iceberg)
-- List snapshots (engine/SQL varies)
SELECT * FROM prod_events.snapshots;
-- Query historical data
SELECT count(*) FROM prod_events VERSION AS OF 123456789;
-- Roll back to a good snapshot
CALL system.rollback_to_snapshot('prod_events', 123456789);
Why it works: Snapshots let you reproduce past results or quickly recover from bad loads.
Hands-on exercises
Do these in any local or cloud environment that supports Delta/Iceberg/Hudi. If you cannot run code, complete them as design exercises and write your answers.
Exercise 1: Choose a format and define an upsert pipeline
Scenario: You receive hourly CDC for a customers table (5–10% rows change per day). Analysts need fresh data and simple queries.
- Pick Delta, Iceberg, or Hudi and justify.
- Write a table definition and an upsert/MERGE example.
- Explain partitioning choice and maintenance (compaction/vacuum).
Hints
- Frequent upserts may favor Hudi or Delta depending on engine and workload.
- Partition by a date like updated_at::date to prune daily queries.
- Plan for compaction if many small files are created.
Exercise 2: Time travel and cleanup plan
Scenario: A bad load duplicated events yesterday. Design a recovery plan using snapshots/time travel and describe how to safely remove unreferenced files afterward.
- Show how to identify the correct snapshot.
- Describe the rollback/time travel query.
- Specify retention and snapshot expiry settings to reclaim storage.
Hints
- List snapshots to find the pre-incident version.
- Rollback resets the table pointer; verify with a count check.
- Run expire/clean commands only after you are sure readers no longer need the old snapshot.
Checklist before you move on
- You can explain table format vs file format.
- You know the core strengths of Delta, Iceberg, and Hudi.
- You can design partitioning and explain trade-offs.
- You understand snapshots, time travel, rollback, and cleanup.
- You can outline compaction/optimization steps.
Common mistakes and self-check
- Over-partitioning: Too many small files. Self-check: Average file size < 128MB? Consider compaction and coarser partitions.
- Ignoring delete semantics: Choosing equality vs position deletes incorrectly. Self-check: Are you deleting many single rows or ranges? Pick the right strategy.
- Unsafe schema changes: Renaming/dropping without planning. Self-check: Can all consumers handle the change? Prefer additive changes.
- No retention policy: Storage grows endlessly. Self-check: Are old snapshots expired regularly? Is there a legal/audit requirement?
- Mixed write engines without coordination: Concurrency issues. Self-check: Confirm engine support and isolation guarantees for the chosen format.
Practical projects
- CDC Customers Lakehouse: Build a customers table with daily upserts, time travel checks, and automated compaction.
- Event Analytics with Rollback Drill: Ingest events, run a faulty load on purpose, then roll back and verify with historical queries.
- Multi-Engine Read: Write with one engine and read with another (e.g., Spark write, Trino/Presto read) to test interoperability.
Suggested steps
- Create bronze (raw) and silver (clean) tables.
- Design partitioning by date and ensure statistics.
- Implement upserts and deletes.
- Add compaction and snapshot expiry jobs.
- Document time travel and rollback procedures.
Who this is for
- Data Platform Engineers establishing lakehouse standards.
- Data Engineers building CDC, batch, and streaming pipelines.
- Analytics Engineers needing reliable tables for BI.
Prerequisites
- Comfort with SQL and basic Spark/engine usage.
- Understanding of Parquet/ORC and object storage concepts.
- Familiarity with batch vs streaming data flows.
Learning path
- Before: File formats (Parquet), partitioning, and object storage basics.
- Now: Table format fundamentals (this lesson) and practical operations.
- Next: Governance, catalog integration, performance tuning, and cost control.
Next steps
- Pick a default table format standard for your platform.
- Codify partitioning and retention policies.
- Automate compaction and snapshot expiry in orchestration.
Mini challenge
Your marketing_events table receives late-arriving updates and occasional deletes. Design the table in your chosen format, show how you will ingest late updates daily, and how you will delete GDPR-requests efficiently without degrading reads. Include your partitioning strategy and maintenance cadence.
Quick Test
Take the quick test below to check your understanding. Anyone can take it; only logged-in users will have their progress saved.