Menu

Topic 5 of 8

Lakehouse Table Formats Basics

Learn Lakehouse Table Formats Basics for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

As a Data Architect, you design how analytics data is stored, updated, and queried at scale. Lakehouse table formats (such as Iceberg-like, Delta-like, and Hudi-like approaches) give data lakes ACID transactions, schema evolution, and time travel. Choosing the right layout, partitioning, and maintenance policies directly impacts query speed, cost, and data reliability.

  • Real tasks you will face: define table layouts for petabyte-scale datasets; plan partitioning and compaction; enable upserts and deletes for privacy; keep historical snapshots manageable; make schema changes without downtime.
  • Success looks like: consistent reads during heavy writes, predictable latencies, minimal small-file degeneration, and auditable change history.

Concept explained simply

A lakehouse table format is a set of rules and metadata that turns plain files (often Parquet) into a reliable table with transactions. Think of it as a layer that tracks which files make up the current table, what the schema is, and how data changes over time—so readers always see a consistent view.

Mental model

  • Data files: columnar files on object storage (e.g., Parquet). Immutable once written.
  • Table metadata: schema, partition spec, properties, and a list of referenced data files.
  • Transaction/snapshot log: an ordered history of commits that add/remove files and change metadata. Readers pick a snapshot, writers atomically add a new snapshot.

Result: ACID semantics on a data lake. Readers never see partial writes; writers don’t corrupt each other’s work.

Core building blocks

  • ACID transactions: atomic commits that create new snapshots.
  • Schemas with evolution: add columns, sometimes rename or reorder (capabilities vary by format). Prefer additive, backward-compatible changes.
  • Partitioning: splits data files by low-cardinality columns (e.g., event_date). Avoid high-cardinality partitions (e.g., user_id).
  • Hidden partitioning/partition transforms: store by day/month/year or bucket/hash to avoid skew.
  • Deletes/Updates:
    • File-level rewrites (copy-on-write): rewrite files with changes applied.
    • Row-level with delete/position files or logs (merge-on-read): track row deletions/updates separately, compact later.
  • Time travel & snapshots: query as of a timestamp/snapshot id.
  • Maintenance: optimize/compact small files into larger ones; expire old snapshots; vacuum orphaned files.
Tip: Picking file sizes

Aim for 128–1024 MB per file depending on query engine and workload. Too small = overhead. Too large = slower splits and skew. Keep it consistent within a table.

Worked examples

Example 1: Clickstream table layout

Scenario: user_events with ~5B daily rows, queries mostly by date ranges and occasionally by country or app_version.

  • Partition: event_date (day). Optional secondary organization: sort within files by country, then user_id to improve locality.
  • Schema: start minimal (event_time, user_id, country, app_version, event_type, payload). Make payload a struct with additive evolution in mind.
  • Compaction: target ~512 MB files; compact hourly to prevent small files from streaming writes.
  • Snapshot retention: keep 7–14 days of snapshots for time travel; longer retention for compliance if required.

Example 2: Safe schema evolution

Scenario: Add marketing_channel (nullable) and rename column country to country_code.

  • Add column: additive and nullable is usually safe; backfill later if needed.
  • Rename: support varies by format. If safe renames are not guaranteed, deprecate old column and add a new one (country_code_new), write to both during transition, then clean up.
  • Enforce invariants with constraints where available (e.g., not null after backfill).

Example 3: Upserts for GDPR deletions

Scenario: You receive delete requests for specific user_ids.

  • Choose a row-level deletion capability: either maintain delete files (merge-on-read) or rewrite affected data files (copy-on-write).
  • For high-volume deletes, prefer strategies that avoid rewriting huge partitions; schedule compaction to reconcile delete files into base data files during low-traffic windows.
  • Validate by time-travel queries: confirm the latest snapshot excludes deleted rows.

Example 4: Reproducible reporting via time travel

Scenario: Finance needs last month’s report rerun exactly as of close.

  • Record the snapshot id or cutoff timestamp at report time.
  • Query engines typically support syntax like “AS OF TIMESTAMP/SNAPSHOT”. Use the stored reference to reproduce results.
  • Retain snapshots long enough to cover audit windows; expire older ones to control storage cost.

Design decisions checklist

  • Define primary query dimensions (time, geo, product) and align partitioning with them.
  • Keep partition columns low-cardinality; use transforms (day/month, bucket) when needed.
  • Target consistent file sizes; compact on a schedule.
  • Choose update strategy: copy-on-write for simplicity; merge-on-read for heavy upserts/deletes.
  • Plan snapshot retention and vacuum policies aligned with audit/compliance.
  • Prefer additive schema changes; handle renames carefully.
  • Document conventions (naming, partition spec, sort order, retention) in a table contract.

Exercises

Do these in a doc or comments. Focus on decisions and trade-offs, not tool-specific commands.

  1. Exercise 1 — Partition and file layout for transactions

    Dataset: ecommerce.transactions (100M/day). Queries are daily/weekly sales totals by country and product_category; streaming inserts 24/7.

    • Propose: partition columns, optional sort/bucket strategy, and target file size.
    • Explain how your choices reduce small files and speed up queries.
  2. Exercise 2 — Schema evolution playbook

    Current schema: order_id, order_ts, user_id, country_code, total_amount. You must add currency_code (default "USD"), and later rename total_amount to amount.

    • Write a step-by-step plan that keeps reads consistent and minimizes rewrites.
    • Include validation steps and rollback ideas.
  3. Exercise 3 — Compaction and retention policy

    Streaming writes produce many small files in a page_views table. Analytics peak is 08:00–20:00 UTC.

    • Define: compaction schedule, thresholds (min file size/count), and target file size.
    • Propose snapshot retention and vacuum settings that balance cost and auditability.
Need a hint?
  • Avoid partitioning by high-cardinality columns like user_id.
  • Compact during off-peak hours; pick a consistent file size target.
  • Prefer additive schema changes; treat renames cautiously.

Common mistakes and self-check

  • Mistake: Partitioning by high-cardinality keys causing millions of tiny partitions. Self-check: Count partitions per day; if it’s in the tens of thousands, rethink.
  • Mistake: Skipping compaction for streaming jobs. Self-check: Median file size > target? Many files < 64 MB? Schedule compaction.
  • Mistake: Aggressive snapshot retention causing runaway storage. Self-check: Track snapshot count and referenced file growth weekly.
  • Mistake: Unsafe renames. Self-check: Confirm your format’s rename semantics; if uncertain, add-new + backfill instead.
  • Mistake: Ignoring delete performance. Self-check: Estimate the blast radius of deletes; test both copy-on-write and merge-on-read patterns.

Mini challenge

Design a table for mobile_app_events (~2B/day) queried mostly by 7–30 day windows and occasionally filtered by app_version and region. You must support user deletions within 7 days and reproducible weekly reports.

  • Write your partition spec, file sizing target, update/delete strategy, compaction schedule, and snapshot retention plan.
  • Explain trade-offs in 5 bullet points.

Who this is for

  • Data Architects and Senior Data Engineers designing storage layers.
  • Analytics Engineers who need reliable tables on data lakes.

Prerequisites

  • Comfort with columnar file formats (e.g., Parquet) and object storage concepts.
  • Basic SQL and data partitioning fundamentals.
  • Familiarity with batch and streaming ingestion patterns.

Learning path

  1. Foundations: files, partitions, and ACID snapshots.
  2. Design: partition transforms, sorting/bucketing, file sizing.
  3. Mutations: upserts, deletes, copy-on-write vs merge-on-read.
  4. Governance: schema evolution, constraints, time travel.
  5. Operations: compaction, snapshot expiry, vacuum.

Practical projects

  • Build a clickstream lakehouse table with streaming ingestion; measure query latency before/after compaction.
  • Implement a GDPR deletion pipeline; validate with time-travel checks and monthly audits.
  • Create a schema evolution demo: add/rename columns safely and document the migration.

Next steps

  • Take the Quick Test to confirm understanding. The test is available to everyone; log in to save your progress.
  • Apply the checklist to a real table in your environment and review results with your team.

Practice Exercises

3 exercises to complete

Instructions

Dataset: ecommerce.transactions (100M/day). Queries are daily/weekly sales totals by country and product_category; streaming inserts 24/7.

Propose partition columns, optional sort/bucket strategy, and target file size. Explain how your choices reduce small files and speed up queries.

Expected Output
A short design with: partition by order_date (day), optional bucket/sort by country then product_category, 512 MB file target; explanation of reduced small files and better predicate pushdown.

Lakehouse Table Formats Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Lakehouse Table Formats Basics?

AI Assistant

Ask questions about this tool