How to learn Lakehouse Table Formats Basics for Physical Data Modeling And Storage Design in Data Architect for free

Why this matters

As a Data Architect, you design how analytics data is stored, updated, and queried at scale. Lakehouse table formats (such as Iceberg-like, Delta-like, and Hudi-like approaches) give data lakes ACID transactions, schema evolution, and time travel. Choosing the right layout, partitioning, and maintenance policies directly impacts query speed, cost, and data reliability.

Real tasks you will face: define table layouts for petabyte-scale datasets; plan partitioning and compaction; enable upserts and deletes for privacy; keep historical snapshots manageable; make schema changes without downtime.
Success looks like: consistent reads during heavy writes, predictable latencies, minimal small-file degeneration, and auditable change history.

Concept explained simply

A lakehouse table format is a set of rules and metadata that turns plain files (often Parquet) into a reliable table with transactions. Think of it as a layer that tracks which files make up the current table, what the schema is, and how data changes over time—so readers always see a consistent view.

Mental model

Data files: columnar files on object storage (e.g., Parquet). Immutable once written.
Table metadata: schema, partition spec, properties, and a list of referenced data files.
Transaction/snapshot log: an ordered history of commits that add/remove files and change metadata. Readers pick a snapshot, writers atomically add a new snapshot.

Result: ACID semantics on a data lake. Readers never see partial writes; writers don’t corrupt each other’s work.

Core building blocks

ACID transactions: atomic commits that create new snapshots.
Schemas with evolution: add columns, sometimes rename or reorder (capabilities vary by format). Prefer additive, backward-compatible changes.
Partitioning: splits data files by low-cardinality columns (e.g., event_date). Avoid high-cardinality partitions (e.g., user_id).
Hidden partitioning/partition transforms: store by day/month/year or bucket/hash to avoid skew.
Deletes/Updates:
- File-level rewrites (copy-on-write): rewrite files with changes applied.
- Row-level with delete/position files or logs (merge-on-read): track row deletions/updates separately, compact later.
Time travel & snapshots: query as of a timestamp/snapshot id.
Maintenance: optimize/compact small files into larger ones; expire old snapshots; vacuum orphaned files.

Tip: Picking file sizes

Aim for 128–1024 MB per file depending on query engine and workload. Too small = overhead. Too large = slower splits and skew. Keep it consistent within a table.

Worked examples

Example 1: Clickstream table layout

Scenario: user_events with ~5B daily rows, queries mostly by date ranges and occasionally by country or app_version.

Partition: event_date (day). Optional secondary organization: sort within files by country, then user_id to improve locality.
Schema: start minimal (event_time, user_id, country, app_version, event_type, payload). Make payload a struct with additive evolution in mind.
Compaction: target ~512 MB files; compact hourly to prevent small files from streaming writes.
Snapshot retention: keep 7–14 days of snapshots for time travel; longer retention for compliance if required.

Example 2: Safe schema evolution

Scenario: Add marketing_channel (nullable) and rename column country to country_code.

Add column: additive and nullable is usually safe; backfill later if needed.
Rename: support varies by format. If safe renames are not guaranteed, deprecate old column and add a new one (country_code_new), write to both during transition, then clean up.
Enforce invariants with constraints where available (e.g., not null after backfill).

Example 3: Upserts for GDPR deletions

Scenario: You receive delete requests for specific user_ids.

Choose a row-level deletion capability: either maintain delete files (merge-on-read) or rewrite affected data files (copy-on-write).
For high-volume deletes, prefer strategies that avoid rewriting huge partitions; schedule compaction to reconcile delete files into base data files during low-traffic windows.
Validate by time-travel queries: confirm the latest snapshot excludes deleted rows.

Example 4: Reproducible reporting via time travel

Scenario: Finance needs last month’s report rerun exactly as of close.

Record the snapshot id or cutoff timestamp at report time.
Query engines typically support syntax like “AS OF TIMESTAMP/SNAPSHOT”. Use the stored reference to reproduce results.
Retain snapshots long enough to cover audit windows; expire older ones to control storage cost.

Design decisions checklist

Define primary query dimensions (time, geo, product) and align partitioning with them.
Keep partition columns low-cardinality; use transforms (day/month, bucket) when needed.
Target consistent file sizes; compact on a schedule.
Choose update strategy: copy-on-write for simplicity; merge-on-read for heavy upserts/deletes.
Plan snapshot retention and vacuum policies aligned with audit/compliance.
Prefer additive schema changes; handle renames carefully.
Document conventions (naming, partition spec, sort order, retention) in a table contract.

Exercises

Do these in a doc or comments. Focus on decisions and trade-offs, not tool-specific commands.

Exercise 1 — Partition and file layout for transactions

Dataset: ecommerce.transactions (100M/day). Queries are daily/weekly sales totals by country and product_category; streaming inserts 24/7.
- Propose: partition columns, optional sort/bucket strategy, and target file size.
- Explain how your choices reduce small files and speed up queries.
Exercise 2 — Schema evolution playbook

Current schema: order_id, order_ts, user_id, country_code, total_amount. You must add currency_code (default "USD"), and later rename total_amount to amount.
- Write a step-by-step plan that keeps reads consistent and minimizes rewrites.
- Include validation steps and rollback ideas.
Exercise 3 — Compaction and retention policy

Streaming writes produce many small files in a page_views table. Analytics peak is 08:00–20:00 UTC.
- Define: compaction schedule, thresholds (min file size/count), and target file size.
- Propose snapshot retention and vacuum settings that balance cost and auditability.

Need a hint?

Avoid partitioning by high-cardinality columns like user_id.
Compact during off-peak hours; pick a consistent file size target.
Prefer additive schema changes; treat renames cautiously.

Common mistakes and self-check

Mistake: Partitioning by high-cardinality keys causing millions of tiny partitions. Self-check: Count partitions per day; if it’s in the tens of thousands, rethink.
Mistake: Skipping compaction for streaming jobs. Self-check: Median file size > target? Many files < 64 MB? Schedule compaction.
Mistake: Aggressive snapshot retention causing runaway storage. Self-check: Track snapshot count and referenced file growth weekly.
Mistake: Unsafe renames. Self-check: Confirm your format’s rename semantics; if uncertain, add-new + backfill instead.
Mistake: Ignoring delete performance. Self-check: Estimate the blast radius of deletes; test both copy-on-write and merge-on-read patterns.

Mini challenge

Design a table for mobile_app_events (~2B/day) queried mostly by 7–30 day windows and occasionally filtered by app_version and region. You must support user deletions within 7 days and reproducible weekly reports.

Write your partition spec, file sizing target, update/delete strategy, compaction schedule, and snapshot retention plan.
Explain trade-offs in 5 bullet points.

Who this is for

Data Architects and Senior Data Engineers designing storage layers.
Analytics Engineers who need reliable tables on data lakes.

Prerequisites

Comfort with columnar file formats (e.g., Parquet) and object storage concepts.
Basic SQL and data partitioning fundamentals.
Familiarity with batch and streaming ingestion patterns.

Learning path

Foundations: files, partitions, and ACID snapshots.
Design: partition transforms, sorting/bucketing, file sizing.
Mutations: upserts, deletes, copy-on-write vs merge-on-read.
Governance: schema evolution, constraints, time travel.
Operations: compaction, snapshot expiry, vacuum.

Practical projects

Build a clickstream lakehouse table with streaming ingestion; measure query latency before/after compaction.
Implement a GDPR deletion pipeline; validate with time-travel checks and monthly audits.
Create a schema evolution demo: add/rename columns safely and document the migration.

Next steps

Take the Quick Test to confirm understanding. The test is available to everyone; log in to save your progress.
Apply the checklist to a real table in your environment and review results with your team.

Menu

Lakehouse Table Formats Basics

Table of Contents

Why this matters

Concept explained simply

Mental model

Core building blocks

Worked examples

Example 1: Clickstream table layout

Example 2: Safe schema evolution

Example 3: Upserts for GDPR deletions

Example 4: Reproducible reporting via time travel

Design decisions checklist

Exercises

Exercise 1 — Partition and file layout for transactions

Exercise 2 — Schema evolution playbook

Exercise 3 — Compaction and retention policy

Common mistakes and self-check

Mini challenge

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Practice Exercises

Partition and file layout for transactions

Instructions

Expected Output

Schema evolution playbook

Compaction and retention policy

Lakehouse Table Formats Basics — Quick Test

Have questions about Lakehouse Table Formats Basics?

AI Assistant