What you'll learn
- Design platform boundaries, service catalogs, and golden paths for data teams.
- Compare lakehouse and warehouse patterns and choose appropriately.
- Plan batch and streaming topologies that are reliable, cost-aware, and scalable.
- Apply multi-tenant isolation for security, performance, and cost controls.
- Estimate capacity, set SLAs/SLOs, and design for reliability.
- Make pragmatic build-versus-buy decisions and plan a platform roadmap.
What does a Data Platform Engineer own?
- Platform scope and service portfolio (ingest, storage, governance, compute, CI/CD, observability).
- Guardrails: standards, templates, and paved paths that accelerate teams safely.
- Non-functional goals: cost efficiency, reliability, security, compliance, and operability.
Who this is for
- Data Platform Engineers, Analytics Engineers, and Data Engineers moving into platform roles.
- Solutions/Cloud Architects supporting data initiatives.
Prerequisites
- Comfort with SQL and basic data modeling (dimensions, facts, partitions).
- Familiarity with cloud storage/compute concepts.
- Basic understanding of pipelines (e.g., Airflow/DBT/Spark/Kafka) is helpful but not required.
Why this skill matters for Data Platform Engineers
Platform architecture decisions determine developer speed, data quality, costs, and reliability. A solid architecture gives teams self-serve capabilities with clear guardrails, supports both batch and streaming, and keeps compliance manageable.
Learning path
1) Define platform scope and service catalog
Document the core services your platform will offer and where responsibility lines are drawn.
- Ingest: batch file loads, CDC, streaming connectors.
- Storage: raw/bronze, refined/silver, curated/gold zones.
- Compute: ETL/ELT engines, SQL, ML features.
- Governance: catalog, lineage, access control, PII handling.
- Observability: logs, metrics, traces, data quality.
2) Choose core architecture patterns
Decide lakehouse, warehouse, or hybrid. Map to your team skills, latency needs, cost profile, and governance requirements.
3) Design batch and streaming flows
Set SLAs/SLOs, partitioning, and quality checks. Define contracts between producers and consumers.
4) Plan multi-tenant isolation
Separate data, compute, and budgets by domain/team. Enforce via IAM, namespaces, and quotas.
5) Estimate capacity and cost
Forecast storage and compute. Size partitions, topics, and autoscaling limits. Put guardrails on spend.
6) Reliability & SLAs
Define SLOs and error budgets. Build fallbacks, retries, idempotency, and runbooks.
7) Build vs buy & roadmap
Pick managed services where they add leverage. Sequence milestones into a realistic roadmap.
Core concepts
Lakehouse vs Warehouse (when to choose which)
- Lakehouse: open formats on object storage, decoupled compute, strong for diverse workloads and cost control.
- Warehouse: tightly integrated engine, strong for BI/SQL performance and governance simplicity.
- Hybrid: warehouse for BI; lakehouse for data science, streaming, and raw archive.
Batch vs Streaming
- Batch: predictable, simpler, cheap for large reprocessing; higher latency.
- Streaming: low-latency, reacts to events; higher operational complexity.
- Often both: stream for freshness, batch for backfills/corrections.
Multi-tenant isolation
- Data isolation: separate buckets/schemas; enforce policies.
- Compute isolation: per-tenant queues/clusters/pools.
- Cost isolation: budgets, quotas, and cost attribution (tags/labels).
Worked examples
Example 1: Lakehouse zones and external tables
Layout object storage by zone and domain to support governance and lifecycle policies.
# Object storage layout
s3://data-lake/bronze/ecommerce/orders/...
s3://data-lake/silver/ecommerce/orders_clean/...
s3://data-lake/gold/ecommerce/orders_mart/...
-- External table on Parquet (warehouse/lakehouse engine)
CREATE EXTERNAL TABLE silver.orders_clean (
order_id STRING,
customer_id STRING,
order_ts TIMESTAMP,
amount DECIMAL(10,2),
country STRING
)
STORED AS PARQUET
LOCATION 's3://data-lake/silver/ecommerce/orders_clean/';
-- Simple quality checks
SELECT COUNT(*) AS total, COUNT(DISTINCT order_id) AS distinct_orders FROM silver.orders_clean;
SELECT * FROM silver.orders_clean WHERE amount < 0 LIMIT 10; -- investigate negatives
Example 2: Streaming ingestion and partitioning
Design topics and partitions to match throughput and ordering needs. Keep keys stable.
# Kafka topic plan
Topic: ecommerce.orders.v1
Key: order_id
Partitions: 24 # based on peak throughput target
Retention: 7 days (raw), 30 days (DLQ)
# PySpark Structured Streaming (conceptual)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
spark = SparkSession.builder.getOrCreate()
schema = "order_id STRING, order_ts TIMESTAMP, amount DOUBLE, customer_id STRING, country STRING"
raw = (spark.readStream
.format("kafka")
.option("subscribe", "ecommerce.orders.v1")
.load())
parsed = (raw.selectExpr("CAST(value AS STRING) as v")
.select(from_json(col("v"), schema).alias("data")).select("data.*"))
# Idempotent write: merge into silver table by order_id
(parsed.writeStream
.outputMode("update")
.format("delta")
.option("checkpointLocation", "s3://chk/orders/")
.start("s3://data-lake/silver/ecommerce/orders_clean/"))
Tip: Ordering and exactly-once
Use a deterministic key, enable idempotent writes (MERGE by key), and store checkpoints for recovery.
Example 3: Multi-tenant isolation policy
{
"tenants": [
{
"name": "marketing",
"data": {
"bucket": "s3://dl-marketing",
"policy": "deny cross-tenant read/write by default"
},
"compute": {
"pool": "spark-pool-marketing",
"max_concurrent_jobs": 20
},
"cost": {
"budget_monthly_usd": 5000,
"alerts": [0.5, 0.8, 1.0]
}
}
],
"global": {
"tagging_required": ["tenant", "env", "dataset"],
"quotas": {"default_cpu": 64, "default_mem_gb": 256}
}
}
Enforce via IAM roles, per-tenant resource groups/namespaces, and policy-as-code.
Example 4: Quick capacity and cost estimate
Storage
Forecast: rows_per_day × avg_row_size. Add growth and retention.
# Example
orders: 20M rows/day × 300B ≈ 6 GB/day raw
Parquet 3x compression ≈ 2 GB/day silver
Retain 365 days → ~730 GB silver
Compute
Backfill window × transform cost per GB. Add headroom for spikes (20–30%).
# Example
Daily ETL: 2 GB × 4 CPU-hours/GB ≈ 8 CPU-hours/day
With 30% headroom → ~10.4 CPU-hours/day
Set budgets and auto-scaling ceilings that reflect these estimates.
Example 5: SLAs, SLOs, and error budgets
# SLO (YAML-style)
service: orders_daily_pipeline
objectives:
- name: delivery_timeliness
target: 0.98 # 98% of daily runs finish by 07:00 UTC
window: 28d
- name: data_freshness
target: 0.99 # 99% events <= 5 min behind
window: 28d
error_budget_policy:
burn_alerts:
- 2h fast-burn > 10%
- 24h slow-burn > 30%
actions:
- freeze feature rollout when budget < 20%
- prioritize reliability fixes
How to use
- Track SLO burn with metrics.
- Pause risky changes if budget is nearly exhausted.
- Update runbooks after incidents to prevent repeats.
Build vs Buy quick guide
- Buy when a capability is commodity, operationally heavy, or not a core differentiator.
- Build when tight customization, cost at scale, or IP/lock-in concerns dominate.
- Hybrid often wins: managed core + custom glue and guardrails.
Drills and exercises
- Draft a one-page service catalog for ingest, storage, compute, governance, and observability.
- Sketch a lakehouse zone layout for two domains (marketing, product).
- Define a topic plan (name, partitions, retention) for 5k events/sec peak.
- Create a tenant isolation policy with IAM roles and cost budgets.
- Estimate 6-month storage growth for two datasets; include compression and retention.
- Write two SLOs and an error budget policy for a daily pipeline.
- List three must-have data quality checks at bronze and silver.
- Decide build vs buy for catalog, orchestration, and streaming; justify each in two sentences.
- Specify backfill procedures for a 30-day correction in orders.
- Write a rollback plan for a schema change that fails in production.
Mini project: Minimal analytics platform
Goal: Stand up a small but realistic platform slice for one domain.
- Define scope: ingest orders and customers; batch daily + streaming for near-real-time metrics.
- Choose architecture: lakehouse on object storage; warehouse for BI serving.
- Layout storage zones and create external tables.
- Ingest streaming orders with a topic plan; backfill last 7 days by batch.
- Implement idempotent upserts into silver with a key-based merge.
- Add three data quality checks and a quarantine path (DLQ).
- Define two SLOs and set alerts.
- Apply multi-tenant isolation (one more domain stub) with budgets and tags.
- Estimate monthly cost and set autoscaling ceilings.
- Write a short runbook: on-call steps, dashboards to check, common recovery actions.
Deliverables checklist
- Architecture diagram (boxes/flows) and assumptions.
- Storage layout + table DDLs.
- Topic/partition plan + retention docs.
- SLO/error budget and runbook.
- Cost estimate and limits.
Common mistakes and debugging tips
- Mistake: One-size-fits-all compute. Tip: isolate pools by workload class; apply concurrency limits.
- Mistake: No contract for schemas. Tip: use schema registry or versioned contracts; fail fast on breaking changes.
- Mistake: Over-partitioned data. Tip: target 128–1024 MB files; compact small files routinely.
- Mistake: Ignoring backfills. Tip: design idempotent jobs; store checkpoints; separate backfill queues.
- Mistake: Unbounded costs. Tip: budgets, tags, and automated alerts; enforce autoscaling ceilings.
- Mistake: SLAs without SLOs. Tip: measure what you promise; manage error budgets.
- Mistake: Shared secrets across tenants. Tip: per-tenant secrets and roles; rotate automatically.
- Mistake: No DLQ. Tip: route bad records; reprocess after fixes.
Practical projects
- Data mart in a day: build bronze→silver→gold for one dataset with tests and SLOs.
- Streaming KPI: compute near-real-time conversion rate with DLQ and late-event handling.
- Cost guardrails: implement tagging, budgets, and automated alerts; measure savings over 30 days.
Subskills
- Platform Scope And Service Catalog — Define platform boundaries and list the services you provide, with clear SLAs.
- Lakehouse Versus Warehouse Concepts — Compare trade-offs and pick the right serving layer for each workload.
- Batch And Streaming Platform Design — Combine daily batch with low-latency streaming reliably.
- Multi Tenant Isolation Concepts — Separate data, compute, and cost for teams/domains safely.
- Cost And Capacity Planning — Forecast storage/compute and set budgets/limits.
- Platform SLAs And Reliability Goals — Define SLOs, error budgets, and runbooks that actually guide work.
- Build Versus Buy Decisions — Choose managed, build in-house, or hybrid based on leverage and risk.
- Platform Roadmap Planning — Sequence milestones, de-risk dependencies, and communicate timelines.
Next steps
- Pick one practical project and complete it end-to-end.
- Review the subskills below to deepen specific areas.
- When you’re ready, take the skill exam to validate your understanding. Everyone can take it; only logged-in users have their progress saved.