Why this matters
As a Data Architect, you design how analytics data is stored, updated, and queried at scale. Lakehouse table formats (such as Iceberg-like, Delta-like, and Hudi-like approaches) give data lakes ACID transactions, schema evolution, and time travel. Choosing the right layout, partitioning, and maintenance policies directly impacts query speed, cost, and data reliability.
- Real tasks you will face: define table layouts for petabyte-scale datasets; plan partitioning and compaction; enable upserts and deletes for privacy; keep historical snapshots manageable; make schema changes without downtime.
- Success looks like: consistent reads during heavy writes, predictable latencies, minimal small-file degeneration, and auditable change history.
Concept explained simply
A lakehouse table format is a set of rules and metadata that turns plain files (often Parquet) into a reliable table with transactions. Think of it as a layer that tracks which files make up the current table, what the schema is, and how data changes over time—so readers always see a consistent view.
Mental model
- Data files: columnar files on object storage (e.g., Parquet). Immutable once written.
- Table metadata: schema, partition spec, properties, and a list of referenced data files.
- Transaction/snapshot log: an ordered history of commits that add/remove files and change metadata. Readers pick a snapshot, writers atomically add a new snapshot.
Result: ACID semantics on a data lake. Readers never see partial writes; writers don’t corrupt each other’s work.
Core building blocks
- ACID transactions: atomic commits that create new snapshots.
- Schemas with evolution: add columns, sometimes rename or reorder (capabilities vary by format). Prefer additive, backward-compatible changes.
- Partitioning: splits data files by low-cardinality columns (e.g., event_date). Avoid high-cardinality partitions (e.g., user_id).
- Hidden partitioning/partition transforms: store by day/month/year or bucket/hash to avoid skew.
- Deletes/Updates:
- File-level rewrites (copy-on-write): rewrite files with changes applied.
- Row-level with delete/position files or logs (merge-on-read): track row deletions/updates separately, compact later.
- Time travel & snapshots: query as of a timestamp/snapshot id.
- Maintenance: optimize/compact small files into larger ones; expire old snapshots; vacuum orphaned files.
Tip: Picking file sizes
Aim for 128–1024 MB per file depending on query engine and workload. Too small = overhead. Too large = slower splits and skew. Keep it consistent within a table.
Worked examples
Example 1: Clickstream table layout
Scenario: user_events with ~5B daily rows, queries mostly by date ranges and occasionally by country or app_version.
- Partition: event_date (day). Optional secondary organization: sort within files by country, then user_id to improve locality.
- Schema: start minimal (event_time, user_id, country, app_version, event_type, payload). Make payload a struct with additive evolution in mind.
- Compaction: target ~512 MB files; compact hourly to prevent small files from streaming writes.
- Snapshot retention: keep 7–14 days of snapshots for time travel; longer retention for compliance if required.
Example 2: Safe schema evolution
Scenario: Add marketing_channel (nullable) and rename column country to country_code.
- Add column: additive and nullable is usually safe; backfill later if needed.
- Rename: support varies by format. If safe renames are not guaranteed, deprecate old column and add a new one (country_code_new), write to both during transition, then clean up.
- Enforce invariants with constraints where available (e.g., not null after backfill).
Example 3: Upserts for GDPR deletions
Scenario: You receive delete requests for specific user_ids.
- Choose a row-level deletion capability: either maintain delete files (merge-on-read) or rewrite affected data files (copy-on-write).
- For high-volume deletes, prefer strategies that avoid rewriting huge partitions; schedule compaction to reconcile delete files into base data files during low-traffic windows.
- Validate by time-travel queries: confirm the latest snapshot excludes deleted rows.
Example 4: Reproducible reporting via time travel
Scenario: Finance needs last month’s report rerun exactly as of close.
- Record the snapshot id or cutoff timestamp at report time.
- Query engines typically support syntax like “AS OF TIMESTAMP/SNAPSHOT”. Use the stored reference to reproduce results.
- Retain snapshots long enough to cover audit windows; expire older ones to control storage cost.
Design decisions checklist
- Define primary query dimensions (time, geo, product) and align partitioning with them.
- Keep partition columns low-cardinality; use transforms (day/month, bucket) when needed.
- Target consistent file sizes; compact on a schedule.
- Choose update strategy: copy-on-write for simplicity; merge-on-read for heavy upserts/deletes.
- Plan snapshot retention and vacuum policies aligned with audit/compliance.
- Prefer additive schema changes; handle renames carefully.
- Document conventions (naming, partition spec, sort order, retention) in a table contract.
Exercises
Do these in a doc or comments. Focus on decisions and trade-offs, not tool-specific commands.
-
Exercise 1 — Partition and file layout for transactions
Dataset: ecommerce.transactions (100M/day). Queries are daily/weekly sales totals by country and product_category; streaming inserts 24/7.
- Propose: partition columns, optional sort/bucket strategy, and target file size.
- Explain how your choices reduce small files and speed up queries.
-
Exercise 2 — Schema evolution playbook
Current schema: order_id, order_ts, user_id, country_code, total_amount. You must add currency_code (default "USD"), and later rename total_amount to amount.
- Write a step-by-step plan that keeps reads consistent and minimizes rewrites.
- Include validation steps and rollback ideas.
-
Exercise 3 — Compaction and retention policy
Streaming writes produce many small files in a page_views table. Analytics peak is 08:00–20:00 UTC.
- Define: compaction schedule, thresholds (min file size/count), and target file size.
- Propose snapshot retention and vacuum settings that balance cost and auditability.
Need a hint?
- Avoid partitioning by high-cardinality columns like user_id.
- Compact during off-peak hours; pick a consistent file size target.
- Prefer additive schema changes; treat renames cautiously.
Common mistakes and self-check
- Mistake: Partitioning by high-cardinality keys causing millions of tiny partitions. Self-check: Count partitions per day; if it’s in the tens of thousands, rethink.
- Mistake: Skipping compaction for streaming jobs. Self-check: Median file size > target? Many files < 64 MB? Schedule compaction.
- Mistake: Aggressive snapshot retention causing runaway storage. Self-check: Track snapshot count and referenced file growth weekly.
- Mistake: Unsafe renames. Self-check: Confirm your format’s rename semantics; if uncertain, add-new + backfill instead.
- Mistake: Ignoring delete performance. Self-check: Estimate the blast radius of deletes; test both copy-on-write and merge-on-read patterns.
Mini challenge
Design a table for mobile_app_events (~2B/day) queried mostly by 7–30 day windows and occasionally filtered by app_version and region. You must support user deletions within 7 days and reproducible weekly reports.
- Write your partition spec, file sizing target, update/delete strategy, compaction schedule, and snapshot retention plan.
- Explain trade-offs in 5 bullet points.
Who this is for
- Data Architects and Senior Data Engineers designing storage layers.
- Analytics Engineers who need reliable tables on data lakes.
Prerequisites
- Comfort with columnar file formats (e.g., Parquet) and object storage concepts.
- Basic SQL and data partitioning fundamentals.
- Familiarity with batch and streaming ingestion patterns.
Learning path
- Foundations: files, partitions, and ACID snapshots.
- Design: partition transforms, sorting/bucketing, file sizing.
- Mutations: upserts, deletes, copy-on-write vs merge-on-read.
- Governance: schema evolution, constraints, time travel.
- Operations: compaction, snapshot expiry, vacuum.
Practical projects
- Build a clickstream lakehouse table with streaming ingestion; measure query latency before/after compaction.
- Implement a GDPR deletion pipeline; validate with time-travel checks and monthly audits.
- Create a schema evolution demo: add/rename columns safely and document the migration.
Next steps
- Take the Quick Test to confirm understanding. The test is available to everyone; log in to save your progress.
- Apply the checklist to a real table in your environment and review results with your team.