How to learn Data Platform Architecture for Data Platform Engineer for free

What you'll learn

Design platform boundaries, service catalogs, and golden paths for data teams.
Compare lakehouse and warehouse patterns and choose appropriately.
Plan batch and streaming topologies that are reliable, cost-aware, and scalable.
Apply multi-tenant isolation for security, performance, and cost controls.
Estimate capacity, set SLAs/SLOs, and design for reliability.
Make pragmatic build-versus-buy decisions and plan a platform roadmap.

What does a Data Platform Engineer own?

Platform scope and service portfolio (ingest, storage, governance, compute, CI/CD, observability).
Guardrails: standards, templates, and paved paths that accelerate teams safely.
Non-functional goals: cost efficiency, reliability, security, compliance, and operability.

Who this is for

Data Platform Engineers, Analytics Engineers, and Data Engineers moving into platform roles.
Solutions/Cloud Architects supporting data initiatives.

Prerequisites

Comfort with SQL and basic data modeling (dimensions, facts, partitions).
Familiarity with cloud storage/compute concepts.
Basic understanding of pipelines (e.g., Airflow/DBT/Spark/Kafka) is helpful but not required.

Why this skill matters for Data Platform Engineers

Platform architecture decisions determine developer speed, data quality, costs, and reliability. A solid architecture gives teams self-serve capabilities with clear guardrails, supports both batch and streaming, and keeps compliance manageable.

Learning path

1) Define platform scope and service catalog

Document the core services your platform will offer and where responsibility lines are drawn.

Ingest: batch file loads, CDC, streaming connectors.
Storage: raw/bronze, refined/silver, curated/gold zones.
Compute: ETL/ELT engines, SQL, ML features.
Governance: catalog, lineage, access control, PII handling.
Observability: logs, metrics, traces, data quality.

2) Choose core architecture patterns

Decide lakehouse, warehouse, or hybrid. Map to your team skills, latency needs, cost profile, and governance requirements.

3) Design batch and streaming flows

Set SLAs/SLOs, partitioning, and quality checks. Define contracts between producers and consumers.

4) Plan multi-tenant isolation

Separate data, compute, and budgets by domain/team. Enforce via IAM, namespaces, and quotas.

5) Estimate capacity and cost

Forecast storage and compute. Size partitions, topics, and autoscaling limits. Put guardrails on spend.

6) Reliability & SLAs

Define SLOs and error budgets. Build fallbacks, retries, idempotency, and runbooks.

7) Build vs buy & roadmap

Pick managed services where they add leverage. Sequence milestones into a realistic roadmap.

Core concepts

Lakehouse vs Warehouse (when to choose which)

Lakehouse: open formats on object storage, decoupled compute, strong for diverse workloads and cost control.
Warehouse: tightly integrated engine, strong for BI/SQL performance and governance simplicity.
Hybrid: warehouse for BI; lakehouse for data science, streaming, and raw archive.

Batch vs Streaming

Batch: predictable, simpler, cheap for large reprocessing; higher latency.
Streaming: low-latency, reacts to events; higher operational complexity.
Often both: stream for freshness, batch for backfills/corrections.

Multi-tenant isolation

Data isolation: separate buckets/schemas; enforce policies.
Compute isolation: per-tenant queues/clusters/pools.
Cost isolation: budgets, quotas, and cost attribution (tags/labels).

Worked examples

Example 1: Lakehouse zones and external tables

Layout object storage by zone and domain to support governance and lifecycle policies.

# Object storage layout
s3://data-lake/bronze/ecommerce/orders/...
s3://data-lake/silver/ecommerce/orders_clean/...
s3://data-lake/gold/ecommerce/orders_mart/...

-- External table on Parquet (warehouse/lakehouse engine)
CREATE EXTERNAL TABLE silver.orders_clean (
  order_id STRING,
  customer_id STRING,
  order_ts TIMESTAMP,
  amount DECIMAL(10,2),
  country STRING
)
STORED AS PARQUET
LOCATION 's3://data-lake/silver/ecommerce/orders_clean/';

-- Simple quality checks
SELECT COUNT(*) AS total, COUNT(DISTINCT order_id) AS distinct_orders FROM silver.orders_clean;
SELECT * FROM silver.orders_clean WHERE amount < 0 LIMIT 10;  -- investigate negatives

Example 2: Streaming ingestion and partitioning

Design topics and partitions to match throughput and ordering needs. Keep keys stable.

# Kafka topic plan
Topic: ecommerce.orders.v1
Key: order_id
Partitions: 24   # based on peak throughput target
Retention: 7 days (raw), 30 days (DLQ)

# PySpark Structured Streaming (conceptual)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json

spark = SparkSession.builder.getOrCreate()
schema = "order_id STRING, order_ts TIMESTAMP, amount DOUBLE, customer_id STRING, country STRING"

raw = (spark.readStream
  .format("kafka")
  .option("subscribe", "ecommerce.orders.v1")
  .load())

parsed = (raw.selectExpr("CAST(value AS STRING) as v")
  .select(from_json(col("v"), schema).alias("data")).select("data.*"))

# Idempotent write: merge into silver table by order_id
(parsed.writeStream
  .outputMode("update")
  .format("delta")
  .option("checkpointLocation", "s3://chk/orders/")
  .start("s3://data-lake/silver/ecommerce/orders_clean/"))

Tip: Ordering and exactly-once

Use a deterministic key, enable idempotent writes (MERGE by key), and store checkpoints for recovery.

Example 3: Multi-tenant isolation policy

{
  "tenants": [
    {
      "name": "marketing",
      "data": {
        "bucket": "s3://dl-marketing",
        "policy": "deny cross-tenant read/write by default"
      },
      "compute": {
        "pool": "spark-pool-marketing",
        "max_concurrent_jobs": 20
      },
      "cost": {
        "budget_monthly_usd": 5000,
        "alerts": [0.5, 0.8, 1.0]
      }
    }
  ],
  "global": {
    "tagging_required": ["tenant", "env", "dataset"],
    "quotas": {"default_cpu": 64, "default_mem_gb": 256}
  }
}

Enforce via IAM roles, per-tenant resource groups/namespaces, and policy-as-code.

Example 4: Quick capacity and cost estimate

Storage

Forecast: rows_per_day × avg_row_size. Add growth and retention.

# Example
orders: 20M rows/day × 300B ≈ 6 GB/day raw
Parquet 3x compression ≈ 2 GB/day silver
Retain 365 days → ~730 GB silver

Compute

Backfill window × transform cost per GB. Add headroom for spikes (20–30%).

# Example
Daily ETL: 2 GB × 4 CPU-hours/GB ≈ 8 CPU-hours/day
With 30% headroom → ~10.4 CPU-hours/day

Set budgets and auto-scaling ceilings that reflect these estimates.

Example 5: SLAs, SLOs, and error budgets

# SLO (YAML-style)
service: orders_daily_pipeline
objectives:
  - name: delivery_timeliness
    target: 0.98   # 98% of daily runs finish by 07:00 UTC
    window: 28d
  - name: data_freshness
    target: 0.99   # 99% events <= 5 min behind
    window: 28d
error_budget_policy:
  burn_alerts:
    - 2h fast-burn > 10%
    - 24h slow-burn > 30%
  actions:
    - freeze feature rollout when budget < 20%
    - prioritize reliability fixes

How to use

Track SLO burn with metrics.
Pause risky changes if budget is nearly exhausted.
Update runbooks after incidents to prevent repeats.

Build vs Buy quick guide

Buy when a capability is commodity, operationally heavy, or not a core differentiator.
Build when tight customization, cost at scale, or IP/lock-in concerns dominate.
Hybrid often wins: managed core + custom glue and guardrails.

Drills and exercises

Draft a one-page service catalog for ingest, storage, compute, governance, and observability.
Sketch a lakehouse zone layout for two domains (marketing, product).
Define a topic plan (name, partitions, retention) for 5k events/sec peak.
Create a tenant isolation policy with IAM roles and cost budgets.
Estimate 6-month storage growth for two datasets; include compression and retention.
Write two SLOs and an error budget policy for a daily pipeline.
List three must-have data quality checks at bronze and silver.
Decide build vs buy for catalog, orchestration, and streaming; justify each in two sentences.
Specify backfill procedures for a 30-day correction in orders.
Write a rollback plan for a schema change that fails in production.

Mini project: Minimal analytics platform

Goal: Stand up a small but realistic platform slice for one domain.

Define scope: ingest orders and customers; batch daily + streaming for near-real-time metrics.
Choose architecture: lakehouse on object storage; warehouse for BI serving.
Layout storage zones and create external tables.
Ingest streaming orders with a topic plan; backfill last 7 days by batch.
Implement idempotent upserts into silver with a key-based merge.
Add three data quality checks and a quarantine path (DLQ).
Define two SLOs and set alerts.
Apply multi-tenant isolation (one more domain stub) with budgets and tags.
Estimate monthly cost and set autoscaling ceilings.
Write a short runbook: on-call steps, dashboards to check, common recovery actions.

Deliverables checklist

Architecture diagram (boxes/flows) and assumptions.
Storage layout + table DDLs.
Topic/partition plan + retention docs.
SLO/error budget and runbook.
Cost estimate and limits.

Common mistakes and debugging tips

Mistake: One-size-fits-all compute. Tip: isolate pools by workload class; apply concurrency limits.
Mistake: No contract for schemas. Tip: use schema registry or versioned contracts; fail fast on breaking changes.
Mistake: Over-partitioned data. Tip: target 128–1024 MB files; compact small files routinely.
Mistake: Ignoring backfills. Tip: design idempotent jobs; store checkpoints; separate backfill queues.
Mistake: Unbounded costs. Tip: budgets, tags, and automated alerts; enforce autoscaling ceilings.
Mistake: SLAs without SLOs. Tip: measure what you promise; manage error budgets.
Mistake: Shared secrets across tenants. Tip: per-tenant secrets and roles; rotate automatically.
Mistake: No DLQ. Tip: route bad records; reprocess after fixes.

Practical projects

Data mart in a day: build bronze→silver→gold for one dataset with tests and SLOs.
Streaming KPI: compute near-real-time conversion rate with DLQ and late-event handling.
Cost guardrails: implement tagging, budgets, and automated alerts; measure savings over 30 days.

Subskills

Platform Scope And Service Catalog — Define platform boundaries and list the services you provide, with clear SLAs.
Lakehouse Versus Warehouse Concepts — Compare trade-offs and pick the right serving layer for each workload.
Batch And Streaming Platform Design — Combine daily batch with low-latency streaming reliably.
Multi Tenant Isolation Concepts — Separate data, compute, and cost for teams/domains safely.
Cost And Capacity Planning — Forecast storage/compute and set budgets/limits.
Platform SLAs And Reliability Goals — Define SLOs, error budgets, and runbooks that actually guide work.
Build Versus Buy Decisions — Choose managed, build in-house, or hybrid based on leverage and risk.
Platform Roadmap Planning — Sequence milestones, de-risk dependencies, and communicate timelines.

Next steps

Pick one practical project and complete it end-to-end.
Review the subskills below to deepen specific areas.
When you’re ready, take the skill exam to validate your understanding. Everyone can take it; only logged-in users have their progress saved.

Menu

Data Platform Architecture

Table of Contents

What you'll learn

Who this is for

Prerequisites

Why this skill matters for Data Platform Engineers

Learning path

1) Define platform scope and service catalog

2) Choose core architecture patterns

3) Design batch and streaming flows

4) Plan multi-tenant isolation

5) Estimate capacity and cost

6) Reliability & SLAs

7) Build vs buy & roadmap

Core concepts

Worked examples

Example 1: Lakehouse zones and external tables

Example 2: Streaming ingestion and partitioning

Example 3: Multi-tenant isolation policy

Example 4: Quick capacity and cost estimate

Example 5: SLAs, SLOs, and error budgets

Drills and exercises

Mini project: Minimal analytics platform

Common mistakes and debugging tips

Practical projects

Subskills

Next steps

Data Platform Architecture — Skill Exam

Topics

Platform Scope And Service Catalog

Lakehouse Versus Warehouse Concepts

Batch And Streaming Platform Design

Multi Tenant Isolation Concepts

Cost And Capacity Planning

Platform SLAs And Reliability Goals

Build Versus Buy Decisions

Platform Roadmap Planning

Have questions about Data Platform Architecture?

AI Assistant