Menu

Topic 6 of 8

Batch And Streaming Integration Basics

Learn Batch And Streaming Integration Basics for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

As a Data Architect, you choose integration patterns that balance freshness, cost, reliability, and complexity. Many business outcomes depend on this choice: real-time alerts, hourly dashboards, daily financials, or backfills after schema changes. Picking batch vs streaming (or micro-batch) determines tooling, SLAs, failure handling, and data contracts across teams.

Concept explained simply

Batch: collect a set of records and process them together on a schedule (e.g., hourly, nightly). Great for high throughput and lower cost. Freshness is the trade-off.

Streaming: process events continuously as they happen (record-by-record or tiny time slices). Great for low latency and reactive apps. Complexity and cost are the trade-off.

Mental model

  • Batch is a truck: fill it up, drive it on a schedule. Efficient, not instant.
  • Streaming is a conveyor belt: items move continuously. Fast, but you must handle jams and variability.

Architect’s triangle: You can optimize two strongly at a time—low latency, low cost, low complexity. Choose deliberately based on business SLAs.

Key terms at a glance

  • Event time vs processing time: when the event occurred vs when you processed it.
  • Windows: tumbling (non-overlapping), sliding (overlapping), session (activity-based).
  • Watermarks: a heuristic boundary for handling late data.
  • Delivery semantics: at-most-once, at-least-once, exactly-once (usually via idempotency + checkpoints).
  • Micro-batch: tiny batches (e.g., 1–60 seconds) approximating streaming with simpler operations.
  • Backpressure: when downstream is slower than upstream; you must throttle or buffer.
  • CDC (Change Data Capture): stream changes from databases.

Worked examples

Example 1 — Daily financial consolidation

Need: Accurate ledger by 7 AM next day. Source: ERP, payments, invoices.

  • Pattern: Batch (nightly).
  • Why: Accuracy over immediacy; controlled reprocessing; predictable cost.
  • Notes: Include backfill path and data quality checks before publish.
Example 2 — Fraud flag within seconds

Need: Flag suspicious transactions within 5 seconds.

  • Pattern: Streaming (low-latency, stateful windows).
  • Why: Real-time requirement; event-time rules; enrichment from reference data.
  • Notes: At-least-once + idempotent sink; dedup on transaction_id.
Example 3 — Marketing attribution with both speed and accuracy

Need: Near-real-time attribution for monitoring and daily corrected numbers for reporting.

  • Pattern: Hybrid (stream for provisional metrics; batch for corrected results).
  • Why: Streaming gives quick visibility; batch corrects late/out-of-order events.
  • Notes: Mark provisional vs final; reconcile using event time and watermarks.
Example 4 — IoT telemetry at scale

Need: Aggregate device metrics every minute; alert on spikes within 30 seconds.

  • Pattern: Micro-batch (15–30s windows) plus alerting on sliding windows.
  • Why: Balances cost and latency; windowing simplifies state management.
  • Notes: Apply backpressure; buffer bursts; define late data policy.

Architecture choices (when to use what)

Choose Batch when…
  • SLAs are minutes to days and accuracy matters more than immediacy.
  • Sources are bulk-friendly (files, snapshots, nightly exports).
  • You need heavy transformations, dimensional modeling, and reprocessing simplicity.
Choose Streaming when…
  • SLAs are seconds to a couple of minutes.
  • Event-driven products: alerts, personalization, monitoring.
  • CDC or clickstream where freshness is critical.
Choose Micro-batch when…
  • You want near-real-time (e.g., 15–60s) with simpler operations than record-by-record.
  • Downstream sinks prefer batches (warehouse loads) but with small delay.
  • Cost/complexity of full streaming is not justified.
Late and out-of-order data
  • Define allowable lateness (e.g., 10 minutes) and use watermarks.
  • Provide a correction path (retractions, upserts, or batch reconciliation).
  • Use event time in windows to avoid skew from processing delays.

Design steps (quick blueprint)

  1. Clarify SLA: freshness target (P95), accuracy rules, and allowable lateness.
  2. Profile data: peak/avg throughput, event size, skew, ordering guarantees.
  3. Contract schema: keys, event time, idempotency key, versioning, nullable fields.
  4. Pick pattern: batch vs streaming vs micro-batch; justify with SLA and cost.
  5. Plan resilience: checkpoints, retries, dead-letter queue, replay strategy.
  6. Define outputs: sinks, partitioning, compaction/upserts, governance and lineage.

Who this is for

  • Aspiring and practicing Data Architects defining integration patterns.
  • Engineers deciding between batch and streaming for new pipelines.
  • Analysts and product partners clarifying data freshness needs.

Prerequisites

  • Basic data modeling (keys, partitions, schemas).
  • ETL/ELT fundamentals and warehouse/lake concepts.
  • Comfort with SLAs, metrics (P95/P99), and simple capacity math.

Learning path

  • Start with batch fundamentals and scheduling.
  • Learn streaming concepts: events, windows, watermarks, delivery semantics.
  • Practice hybrid designs and late data handling.
  • Add reliability: checkpoints, DLQs, replay, idempotent sinks.

Exercises

These mirror the graded exercises below. Do them here, then record your final answers in the exercise section. Everyone can take the test; only logged-in users will have progress saved.

Exercise 1 — Map needs to Batch, Streaming, or Micro-batch

For each scenario, choose Batch, Streaming, or Micro-batch and write one-sentence justification.

  • A. CFO monthly financial close by next morning.
  • B. Payment fraud rule within 5 seconds.
  • C. Product search suggestions updated every 15 minutes.
  • D. Daily sales report by 7 AM with corrected late transactions.
  • E. Ad-click counter for a live dashboard with sub-10-second freshness.
Show solution (sample)
  • A: Batch — accuracy over immediacy.
  • B: Streaming — low-latency alerting; stateful rules.
  • C: Micro-batch — small windows balance cost and freshness.
  • D: Batch — reprocess for corrections and completeness.
  • E: Streaming — near-real-time aggregation.
Exercise 2 — Throughput, partitions, and storage

Assume: average 5,000 events/sec; peak 20,000 events/sec; 1 KB/event; P95 end-to-end 5 seconds; allow 10 minutes lateness.

  • 1) Compute peak ingress MB/s.
  • 2) Estimate raw storage per day and 30 days.
  • 3) Propose topic partitions and consumer concurrency to meet SLA.
  • 4) Choose micro-batch interval or true streaming and justify.
  • 5) State dedup/idempotency strategy and checkpointing.
Show solution (sample)
  • 1) Peak ingress: ~20 MB/s (20,000 × 1 KB).
  • 2) Daily: 5,000 × 86,400 × 1 KB ≈ 432 GB; 30 days ≈ 12.96 TB (raw, before compression).
  • 3) Partitions: 24–48 partitions; start with 32 and 8–16 consumers for headroom.
  • 4) Micro-batch 1s windows or true streaming; either fits 5s P95 with checkpoints; micro-batch eases sink loads.
  • 5) Idempotent upserts keyed by event_id; at-least-once delivery; enable checkpoints and DLQ for poison messages.

Self-check checklist

  • I matched SLAs to the simplest pattern that meets them.
  • I accounted for late/out-of-order events with watermarks and corrections.
  • I planned idempotency and dedup at sinks.
  • I sized partitions/concurrency for peak, not average.
  • I included replay/DLQ and backfill paths.

Common mistakes (and how to catch them)

  • Choosing streaming when batch is enough: ask for a measurable SLA in seconds/minutes.
  • Confusing event time and processing time: always embed event_time; window on event time.
  • No strategy for late data: define lateness, watermark, and correction process.
  • Ignoring idempotency: require unique keys and upsert/merge semantics.
  • Under-sizing for peak: design for P95/P99 bursts; test with load.
  • Lack of replay: ensure checkpoints and a reliable replay from storage.

Practical projects

  • Build a nightly batch load that publishes a curated table with data quality checks.
  • Build a streaming aggregator that emits 10-second rolling counts and handles late data with a 5-minute watermark.
  • Add a dead-letter queue and a replay script that reprocesses failed events safely.
  • Demonstrate schema evolution: add a nullable field, roll forward; then reprocess a day.

Mini challenge

You have a product metrics dashboard needing P95 freshness under 2 minutes, plus an audited monthly report. Propose a design: which parts are streaming, which are batch, how you’ll handle late data, and how you’ll reconcile provisional vs final numbers. Keep your answer to 5–7 sentences.

Next steps

  • Study CDC patterns and how they integrate with streaming pipelines.
  • Practice cost-aware design: batch where possible, streaming where necessary.
  • Add observability: end-to-end latency, lag, watermark delay, and error rates.

Practice Exercises

2 exercises to complete

Instructions

For each scenario, choose Batch, Streaming, or Micro-batch and write one-sentence justification.

  • A. CFO monthly financial close by next morning.
  • B. Payment fraud rule within 5 seconds.
  • C. Product search suggestions updated every 15 minutes.
  • D. Daily sales report by 7 AM with corrected late transactions.
  • E. Ad-click counter for a live dashboard with sub-10-second freshness.
Expected Output
A: Batch; B: Streaming; C: Micro-batch; D: Batch; E: Streaming. Each with a one-sentence justification tied to SLA and complexity.

Batch And Streaming Integration Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Batch And Streaming Integration Basics?

AI Assistant

Ask questions about this tool