Why this matters
Data Platform Engineers are often asked to deliver both real-time insights and reliable historical analytics. You will decide how data flows, how fresh it should be, and how to keep costs and complexity under control. Typical tasks:
- Design a pipeline that powers a live dashboard (seconds) and daily finance reports (hours).
- Choose between micro-batch or true streaming for clickstream, IoT, or payments.
- Define SLAs for latency, completeness, and correctness; plan for late and duplicate events.
- Set up orchestration, retries, backfills, and schema evolution without breaking consumers.
Note about progress: Anyone can take the exercises and quick test below. If you sign in, your progress is saved automatically.
Who this is for
- Data Platform Engineers and Data Engineers designing ingestion and processing layers.
- Analytics Engineers who need reliable upstream data contracts and freshness.
- Backend Engineers integrating event streams with operational services.
Prerequisites
- Comfort with SQL and at least one data processing engine (e.g., Spark or Flink).
- Basic knowledge of message queues/logs (e.g., Kafka-like systems) and object storage.
- Familiarity with orchestration concepts (e.g., DAGs, retries, backfills).
Concept explained simply
Batch means you collect data over a period and process it together (minutes to hours). Streaming means you process events continuously as they arrive (milliseconds to seconds). Many platforms use both: streaming for fresh signals and batch for full accuracy and cost efficiency.
Mental model
Imagine water flowing through pipes into reservoirs:
- Stream: the pipe delivers water continuously to a small tank for quick use.
- Batch: overnight, big reservoirs refill and are used to produce high-quality bottled water.
- Your job: design the pipes, tanks, and schedule so no one runs dry and quality stays high.
Core building blocks
- Sources: apps, databases, devices, third-party APIs.
- Transport: durable event log or queue with partitions and retention policies.
- Storage: object store (data lake), warehouse tables, key-value/state stores.
- Compute: batch engines (e.g., Spark), stream processors (e.g., Flink/Beam), SQL engines.
- Serving: dashboards, feature stores, APIs, reverse ETL.
- Control: orchestration, schema registry, catalog/lineage, observability.
- Governance: access control, PII handling, cost and quota guards.
Key design choices
- Freshness vs cost: lower latency usually increases complexity and spend.
- Delivery semantics: at-most-once, at-least-once, or effectively-once via idempotency and upserts.
- Windowing: tumbling, hopping, session windows; late data handling with watermarks.
- State management: where to keep state (operator state, external store), and how to checkpoint.
- Scalability: partitions, consumer groups, autoscaling, and backpressure controls.
- Schema evolution: backward/forward compatibility and contracts to avoid breaking changes.
Worked examples
Example 1: Clickstream β real-time KPIs + daily attribution
- Goal: 10-second dashboard for active users; accurate daily marketing attribution next morning.
- Design:
- Ingest events to a durable log with 3-day retention.
- Streaming job aggregates per minute; writes to a low-latency store for dashboards.
- Nightly batch reads raw events from object storage, performs full attribution joins, and publishes curated tables.
- Late data: set watermark of 5 minutes for streaming; batch reprocesses the full day for accuracy.
- Contracts: define event schemas and allowed-lateness; add dedup keys.
Example 2: Orders CDC β warehouse + fraud scoring
- Goal: Near-real-time order status for fraud and hourly finance snapshots.
- Design:
- Log-based CDC from OLTP emits inserts/updates/deletes to the event log.
- Stream processor maintains an upserted orders topic/table for fraud features.
- Hourly batch compacts changes into warehouse fact tables via MERGE/UPSERT.
- Idempotency: use primary keys and exactly-once sinks or transactional upserts.
- Backfills: re-snapshot by table partitions; replay CDC with checkpoints.
Example 3: IoT telemetry β out-of-order and spikes
- Goal: 30-second health score per device; daily reliability report.
- Design:
- Event log with 7-day retention and sufficient partitions for bursty traffic.
- Stream job with event-time processing, watermark of 2 minutes, allowed lateness of 10 minutes, dead-letter for malformed events.
- Daily batch recomputes reliability across the full day, correcting late arrivals.
- Scaling: consumer autoscaling based on lag; backpressure propagates to producers when needed.
Capacity quick math
Show a simple sizing approach
- Ingress: events_per_sec Γ avg_event_bytes = ingest_bandwidth. Example: 50,000 Γ 1 KB β 50 MB/s.
- Storage for log: ingest_bandwidth Γ retention_seconds. Example: 50 MB/s Γ 7 days β 30 TB (plus replication).
- Partitions: target 5β20 MB/s per partition; 50 MB/s β 3β10 partitions (choose higher for headroom).
- Consumers: total_capacity = threads Γ per_thread_rate; ensure partitions β₯ threads for parallelism.
How to design a batch + streaming platform
- Define SLAs and contracts. Capture max latency, completeness targets, and error budgets. Specify event schemas and dedup keys.
- Segment use cases. Classify into real-time (seconds), near-real-time (minutes), and analytical (hours/days).
- Choose data paths. For each segment, pick streaming or batch (or both) and the serving store suited to query patterns.
- Plan for bad days. Retries, dead-letter queues, replays, and backfills must be first-class.
- Observe and control. Logging, metrics (lag, watermark, row counts), data quality checks, and cost guards.
Exercises you can do now
These mirror the tasks in the Exercises panel below. Do them here, then compare your results with the solutions.
Exercise 1 β Dual-path design
Design a platform for a ride-sharing app that needs driver ETA updates under 5 seconds and daily revenue reports at 06:00.
- List components for ingestion, processing, storage, and serving.
- Explain deduplication, late data handling, and backfills.
- Define SLAs and monitoring signals.
Peek a sample solution
Streaming path: mobile events β event log (24β72h retention) β stream processor with event-time windows (watermark 1β2 min) β low-latency store/API for ETAs; dedup by trip_id + sequence. Batch path: raw events β object storage (partitioned by date/hour) β 04:00 DAG builds revenue facts with MERGE/UPSERT to handle late/duplicate events; reprocess last 2 days daily. SLAs: ETA P95 under 5s; finance tables ready by 06:00 with β₯99.9% completeness; monitors include consumer lag, watermark delay, record counts, and null checks.
Exercise 2 β Sizing and policy
Given 50k events/s at 1 KB each and 7 days log retention:
- Estimate storage needed (no replication) and propose a partition count.
- Define a late data policy for a 1-minute tumbling window metric.
Peek a sample solution
Throughput β 50 MB/s; storage β 50 MB/s Γ 604,800 s β 30 TB (raw). Partitions: target β€5 MB/s per partition β about 10β12 partitions; round to 16β24 for headroom. Late data: watermark at 2 minutes; allowed lateness 10 minutes; updates emit upserts to serving table; older-than-allowed events routed to dead-letter and corrected by batch backfill.
Checklist β A good design includes
- Clear SLAs (latency, completeness) and data contracts (schemas, dedup keys).
- Both fast path and accurate path when needed, with reconciliation.
- Windowing and lateness strategy with watermarks and upserts.
- Replay/backfill plan and dead-letter handling.
- Right stores for query patterns (OLAP for analytics, low-latency store for APIs).
- Observability: lag, watermark delay, row counts, DQ tests, cost tracking.
Common mistakes and self-check
- Only streaming, no reconciliation. Self-check: Do you have a periodic batch that corrects metrics? If not, add it.
- No dedup keys. Self-check: Can you idempotently upsert by a natural/business key? If not, add one.
- Ignoring late/out-of-order events. Self-check: Did you set watermarks and allowed lateness? Where do late events go?
- Too few partitions. Self-check: Can consumers scale horizontally? Partitions β₯ consumer threads.
- Unmanaged schema evolution. Self-check: Are new fields optional with defaults? Do you validate producers?
Practical projects
- Build a tiny pipeline that reads events, produces a 1-minute rolling metric in a streaming job, and a nightly batch that recomputes the day and reconciles differences.
- Add a dead-letter path and a replay command that reprocesses the last N hours.
- Introduce a schema change (add an optional field) and show that consumers keep working.
Learning path
- Master event-time, watermarks, and windows.
- Practice idempotent upserts and deduplication patterns.
- Design dual-path (fast + accurate) pipelines.
- Add observability: lag, freshness, DQ tests; automate backfills.
- Optimize costs: compaction, tiered storage, right-sized partitions.
Mini challenge (15 minutes)
A campaign starts at 09:00. Marketing needs a real-time signups-per-channel chart (β€10s) and a final per-channel CPA by next day 08:00.
- Sketch your fast path and accurate path.
- Define watermarks and allowed lateness.
- Describe reconciliation: how will batch correct the real-time numbers?
Next steps
- Write runbooks: replay/backfill procedures, incident response for lag spikes.
- Add data quality checks at ingress and before serving.
- Extend the platform with feature serving for ML and cost alerts for hot partitions.