Learn when to use a data lakehouse versus a data warehouse, how they complement each other, and how to decide under real-world constraints of cost, latency, governance, and team skill sets.
Note: Quick test is available to everyone; only logged-in users have their progress saved.
Who this is for
- Data Platform Engineers and Data Engineers deciding platform architecture.
- Analytics Engineers and BI Leads evaluating storage and serving patterns.
- Architects building unified platforms for BI, ML, and streaming.
Prerequisites
- Basic SQL and data modeling (facts/dimensions, star schema).
- Familiarity with batch vs streaming ingestion.
- Understanding of object storage, files (e.g., Parquet), and table partitions.
Why this matters
- Platform choices affect latency, reliability, and cost for years.
- Teams must serve diverse workloads: BI dashboards, ad-hoc analytics, ML features, real-time serving.
- Compliance and governance depend on how data is stored, versioned, and audited.
Real tasks you might face:
- Consolidate raw logs, event streams, and CRM data into one platform.
- Enable BI and ML from the same source of truth without duplicating pipelines.
- Reduce costs while keeping performance and governance strong.
Concept explained simply
Data warehouse: Optimized system for structured, governed analytics. You decide the schema upfront (schema-on-write), transform data, and load into curated tables for fast SQL and dashboards.
Data lakehouse: A data lake with a transactional table layer that brings warehouse-like reliability and performance to files in object storage. It supports both BI and ML, batch and streaming, using open file formats plus a table format that adds ACID, time travel, and governance.
Mental model
- Warehouse = well-organized restaurant dining room: fixed menu (schema), fast service (optimized queries), strict operations.
- Lakehouse = restaurant plus pantry and test kitchen: open ingredients (raw files), a rulebook (table format) to cook safely (ACID), and service for diners (BI) and chefs experimenting (ML).
Key differences at a glance
Open to expand
- Storage: Warehouse uses proprietary or managed columnar storage. Lakehouse uses object storage with open file formats (e.g., Parquet) plus a table format (e.g., Delta/Iceberg/Hudi) for ACID.
- Schema: Warehouse is schema-on-write (strict upfront). Lakehouse leans schema-on-read for raw zones, with schema-on-write applied to curated tables.
- Workloads: Warehouse excels at BI/SQL. Lakehouse covers BI + ML/DS + streaming on one platform.
- Performance: Warehouse is tuned for SQL concurrency and low-latency analytics. Lakehouse performance depends on layout, file sizes, partitioning, caching, and table maintenance.
- Governance: Warehouse typically central and rigid. Lakehouse uses catalogs, table constraints, and table history; governance maturity depends on implementation.
- Cost: Lakehouse often cheaper storage; compute elastic. Warehouse may cost more per TB but can be simpler for pure BI. Actuals vary.
Architecture components
Lakehouse components
- Ingestion: batch and streaming (change data capture, events).
- Storage layers (Medallion): Bronze (raw), Silver (clean/conformed), Gold (business-ready/serving).
- Table format: adds ACID, time travel, schema evolution, compaction.
- Catalog + governance: data dictionary, ownership, access policies, quality checks.
- Compute engines: SQL, batch processing, streaming processors, ML runtimes.
- Serving: BI connectors, feature store patterns, data products/APIs.
Warehouse components
- ETL/ELT to staging and curated layers.
- Dimensional models (facts/dimensions), semantic layer/metrics.
- Strong query optimizer, concurrency controls, workload isolation.
- Integrated governance, lineage, and role-based access control.
Worked examples
Example 1: Sales BI dashboards
Need: Reliable daily dashboards with star schemas; strict KPIs; moderate data variety.
Choice: Warehouse-only or lakehouse Gold tables with a semantic layer. If your org is BI-centric and wants minimal moving parts, warehouse is great. If you also need ML later, a lakehouse with well-modeled Gold tables works too.
Example 2: ML on clickstream + CRM
Need: Join semi-structured event logs with reference data; train and serve models; time-travel for experiments.
Choice: Lakehouse. Store logs in Bronze, clean to Silver, aggregate features to Gold; use table format for ACID and reproducibility.
Example 3: Real-time IoT monitoring
Need: Stream ingestion, near-real-time transforms, anomaly detection, and BI rollups.
Choice: Lakehouse with streaming into Bronze, incremental upserts into Silver, and materialized metrics in Gold. Serve BI from Gold and ML features from Silver/Gold.
Example 4: Finance regulatory reporting
Need: Strict schemas, audit trails, stable SLAs, complex SQL, slowly changing dimensions.
Choice: Warehouse or a lakehouse with strong governance and curated Gold. If team is warehouse-native and workloads are SQL-only, warehouse may be simpler. If you also require ML and data science, lakehouse with tight controls is suitable.
Design decision framework
- Workloads: BI-only vs BI + ML + streaming?
- Data types: structured only vs mix of structured, semi-structured, unstructured?
- Latency: batch, micro-batch, or near-real-time?
- Governance: how strict are access, lineage, retention, and audit needs?
- Team skills: SQL-centric vs also fluent in data engineering and ML tooling?
- Cost posture: storage-heavy vs compute-heavy, elasticity, and scale expectations.
Quick checklist before deciding:
- I know the main workloads and their SLAs.
- I mapped data sources and their structure/volume/velocity.
- I documented governance (PII, retention, audit) requirements.
- I estimated storage and compute costs for 12–24 months.
- I planned for schema evolution and backfills.
Exercises
These mirror the exercises below. Draft your answer first, then open the solution for a reference approach.
-
Exercise 1 (ex1): Pick an architecture
Scenario: Marketing wants a 360° customer view combining CRM (structured), web events (JSON), and support tickets (text). They need weekly BI dashboards and quarterly ML propensity models. Choose warehouse, lakehouse, or hybrid and outline ingestion, storage layers, and serving.
-
Exercise 2 (ex2): Medallion plan
Design Bronze/Silver/Gold for a streaming clickstream + daily product catalog feeds use case. Include partitioning strategy, compaction approach, and quality checks.
Common mistakes
- One-size-fits-all: forcing warehouse for ML-heavy needs, or lakehouse without governance for regulated BI.
- Skipping table maintenance in lakehouse: small files, no compaction, poor partitioning.
- Serving directly from raw/bronze: unstable schemas, hard-to-debug metrics.
- Ignoring a semantic layer: inconsistent KPIs across teams.
- Underestimating costs: excessive materializations or unbounded history.
Self-check:
- Do my serving tables have clear owners, SLAs, and documented metrics?
- Can I reproduce last month’s results with time travel/versioning?
- Are file sizes and partitions tuned to query patterns?
Practical projects
- Build a mini lakehouse: ingest JSON events and CSV refs, implement Bronze/Silver/Gold, run a few BI queries, and train a simple model using the same Gold tables.
- Warehouse-style star schema inside a lakehouse: design a Sales star and compare query latency before and after optimizing file sizes and partitions.
- Governance drill: add tags for PII columns, set row/column-level rules, and validate that BI and ML both respect policies.
Mini challenge
Your leadership asks for: near-real-time product analytics (under 5 minutes), monthly churn model, and a CFO-ready weekly margin dashboard. Propose a minimal architecture that satisfies all three with the least duplication.
See a sample approach
- Ingest streams to Bronze; daily batch to Bronze for finance data.
- Incremental upserts to Silver with quality checks; compact small files daily.
- Gold: materialize dashboard marts and model-ready features. Serve BI from Gold; train from Silver/Gold; use time travel for reproducibility.
Learning path
- Review data modeling basics (facts, dimensions, slowly changing dimensions).
- Learn medallion architecture and table formats (ACID, schema evolution, time travel).
- Practice performance tuning: partitioning, Z-ordering/sorting, compaction.
- Add governance: catalog, lineage, access control, data quality checks.
Next steps
- Pick one of the practical projects and implement end-to-end.
- Introduce a semantic/metrics layer for consistent KPIs.
- Add monitoring for freshness, volume, and quality SLAs.
Quick Test
Anyone can take the test. Log in to save your progress and resume later.