Why this matters
As a Data Architect, you choose integration patterns that balance freshness, cost, reliability, and complexity. Many business outcomes depend on this choice: real-time alerts, hourly dashboards, daily financials, or backfills after schema changes. Picking batch vs streaming (or micro-batch) determines tooling, SLAs, failure handling, and data contracts across teams.
Concept explained simply
Batch: collect a set of records and process them together on a schedule (e.g., hourly, nightly). Great for high throughput and lower cost. Freshness is the trade-off.
Streaming: process events continuously as they happen (record-by-record or tiny time slices). Great for low latency and reactive apps. Complexity and cost are the trade-off.
Mental model
- Batch is a truck: fill it up, drive it on a schedule. Efficient, not instant.
- Streaming is a conveyor belt: items move continuously. Fast, but you must handle jams and variability.
Architect’s triangle: You can optimize two strongly at a time—low latency, low cost, low complexity. Choose deliberately based on business SLAs.
Key terms at a glance
- Event time vs processing time: when the event occurred vs when you processed it.
- Windows: tumbling (non-overlapping), sliding (overlapping), session (activity-based).
- Watermarks: a heuristic boundary for handling late data.
- Delivery semantics: at-most-once, at-least-once, exactly-once (usually via idempotency + checkpoints).
- Micro-batch: tiny batches (e.g., 1–60 seconds) approximating streaming with simpler operations.
- Backpressure: when downstream is slower than upstream; you must throttle or buffer.
- CDC (Change Data Capture): stream changes from databases.
Worked examples
Example 1 — Daily financial consolidation
Need: Accurate ledger by 7 AM next day. Source: ERP, payments, invoices.
- Pattern: Batch (nightly).
- Why: Accuracy over immediacy; controlled reprocessing; predictable cost.
- Notes: Include backfill path and data quality checks before publish.
Example 2 — Fraud flag within seconds
Need: Flag suspicious transactions within 5 seconds.
- Pattern: Streaming (low-latency, stateful windows).
- Why: Real-time requirement; event-time rules; enrichment from reference data.
- Notes: At-least-once + idempotent sink; dedup on transaction_id.
Example 3 — Marketing attribution with both speed and accuracy
Need: Near-real-time attribution for monitoring and daily corrected numbers for reporting.
- Pattern: Hybrid (stream for provisional metrics; batch for corrected results).
- Why: Streaming gives quick visibility; batch corrects late/out-of-order events.
- Notes: Mark provisional vs final; reconcile using event time and watermarks.
Example 4 — IoT telemetry at scale
Need: Aggregate device metrics every minute; alert on spikes within 30 seconds.
- Pattern: Micro-batch (15–30s windows) plus alerting on sliding windows.
- Why: Balances cost and latency; windowing simplifies state management.
- Notes: Apply backpressure; buffer bursts; define late data policy.
Architecture choices (when to use what)
Choose Batch when…
- SLAs are minutes to days and accuracy matters more than immediacy.
- Sources are bulk-friendly (files, snapshots, nightly exports).
- You need heavy transformations, dimensional modeling, and reprocessing simplicity.
Choose Streaming when…
- SLAs are seconds to a couple of minutes.
- Event-driven products: alerts, personalization, monitoring.
- CDC or clickstream where freshness is critical.
Choose Micro-batch when…
- You want near-real-time (e.g., 15–60s) with simpler operations than record-by-record.
- Downstream sinks prefer batches (warehouse loads) but with small delay.
- Cost/complexity of full streaming is not justified.
Late and out-of-order data
- Define allowable lateness (e.g., 10 minutes) and use watermarks.
- Provide a correction path (retractions, upserts, or batch reconciliation).
- Use event time in windows to avoid skew from processing delays.
Design steps (quick blueprint)
- Clarify SLA: freshness target (P95), accuracy rules, and allowable lateness.
- Profile data: peak/avg throughput, event size, skew, ordering guarantees.
- Contract schema: keys, event time, idempotency key, versioning, nullable fields.
- Pick pattern: batch vs streaming vs micro-batch; justify with SLA and cost.
- Plan resilience: checkpoints, retries, dead-letter queue, replay strategy.
- Define outputs: sinks, partitioning, compaction/upserts, governance and lineage.
Who this is for
- Aspiring and practicing Data Architects defining integration patterns.
- Engineers deciding between batch and streaming for new pipelines.
- Analysts and product partners clarifying data freshness needs.
Prerequisites
- Basic data modeling (keys, partitions, schemas).
- ETL/ELT fundamentals and warehouse/lake concepts.
- Comfort with SLAs, metrics (P95/P99), and simple capacity math.
Learning path
- Start with batch fundamentals and scheduling.
- Learn streaming concepts: events, windows, watermarks, delivery semantics.
- Practice hybrid designs and late data handling.
- Add reliability: checkpoints, DLQs, replay, idempotent sinks.
Exercises
These mirror the graded exercises below. Do them here, then record your final answers in the exercise section. Everyone can take the test; only logged-in users will have progress saved.
Exercise 1 — Map needs to Batch, Streaming, or Micro-batch
For each scenario, choose Batch, Streaming, or Micro-batch and write one-sentence justification.
- A. CFO monthly financial close by next morning.
- B. Payment fraud rule within 5 seconds.
- C. Product search suggestions updated every 15 minutes.
- D. Daily sales report by 7 AM with corrected late transactions.
- E. Ad-click counter for a live dashboard with sub-10-second freshness.
Show solution (sample)
- A: Batch — accuracy over immediacy.
- B: Streaming — low-latency alerting; stateful rules.
- C: Micro-batch — small windows balance cost and freshness.
- D: Batch — reprocess for corrections and completeness.
- E: Streaming — near-real-time aggregation.
Exercise 2 — Throughput, partitions, and storage
Assume: average 5,000 events/sec; peak 20,000 events/sec; 1 KB/event; P95 end-to-end 5 seconds; allow 10 minutes lateness.
- 1) Compute peak ingress MB/s.
- 2) Estimate raw storage per day and 30 days.
- 3) Propose topic partitions and consumer concurrency to meet SLA.
- 4) Choose micro-batch interval or true streaming and justify.
- 5) State dedup/idempotency strategy and checkpointing.
Show solution (sample)
- 1) Peak ingress: ~20 MB/s (20,000 × 1 KB).
- 2) Daily: 5,000 × 86,400 × 1 KB ≈ 432 GB; 30 days ≈ 12.96 TB (raw, before compression).
- 3) Partitions: 24–48 partitions; start with 32 and 8–16 consumers for headroom.
- 4) Micro-batch 1s windows or true streaming; either fits 5s P95 with checkpoints; micro-batch eases sink loads.
- 5) Idempotent upserts keyed by event_id; at-least-once delivery; enable checkpoints and DLQ for poison messages.
Self-check checklist
- I matched SLAs to the simplest pattern that meets them.
- I accounted for late/out-of-order events with watermarks and corrections.
- I planned idempotency and dedup at sinks.
- I sized partitions/concurrency for peak, not average.
- I included replay/DLQ and backfill paths.
Common mistakes (and how to catch them)
- Choosing streaming when batch is enough: ask for a measurable SLA in seconds/minutes.
- Confusing event time and processing time: always embed event_time; window on event time.
- No strategy for late data: define lateness, watermark, and correction process.
- Ignoring idempotency: require unique keys and upsert/merge semantics.
- Under-sizing for peak: design for P95/P99 bursts; test with load.
- Lack of replay: ensure checkpoints and a reliable replay from storage.
Practical projects
- Build a nightly batch load that publishes a curated table with data quality checks.
- Build a streaming aggregator that emits 10-second rolling counts and handles late data with a 5-minute watermark.
- Add a dead-letter queue and a replay script that reprocesses failed events safely.
- Demonstrate schema evolution: add a nullable field, roll forward; then reprocess a day.
Mini challenge
You have a product metrics dashboard needing P95 freshness under 2 minutes, plus an audited monthly report. Propose a design: which parts are streaming, which are batch, how you’ll handle late data, and how you’ll reconcile provisional vs final numbers. Keep your answer to 5–7 sentences.
Next steps
- Study CDC patterns and how they integrate with streaming pipelines.
- Practice cost-aware design: batch where possible, streaming where necessary.
- Add observability: end-to-end latency, lag, watermark delay, and error rates.