How to learn Batch And Streaming Integration Basics for Integration Architecture ETL ELT in Data Architect for free

Why this matters

As a Data Architect, you choose integration patterns that balance freshness, cost, reliability, and complexity. Many business outcomes depend on this choice: real-time alerts, hourly dashboards, daily financials, or backfills after schema changes. Picking batch vs streaming (or micro-batch) determines tooling, SLAs, failure handling, and data contracts across teams.

Concept explained simply

Batch: collect a set of records and process them together on a schedule (e.g., hourly, nightly). Great for high throughput and lower cost. Freshness is the trade-off.

Streaming: process events continuously as they happen (record-by-record or tiny time slices). Great for low latency and reactive apps. Complexity and cost are the trade-off.

Mental model

Batch is a truck: fill it up, drive it on a schedule. Efficient, not instant.
Streaming is a conveyor belt: items move continuously. Fast, but you must handle jams and variability.

Architect’s triangle: You can optimize two strongly at a time—low latency, low cost, low complexity. Choose deliberately based on business SLAs.

Key terms at a glance

Event time vs processing time: when the event occurred vs when you processed it.
Windows: tumbling (non-overlapping), sliding (overlapping), session (activity-based).
Watermarks: a heuristic boundary for handling late data.
Delivery semantics: at-most-once, at-least-once, exactly-once (usually via idempotency + checkpoints).
Micro-batch: tiny batches (e.g., 1–60 seconds) approximating streaming with simpler operations.
Backpressure: when downstream is slower than upstream; you must throttle or buffer.
CDC (Change Data Capture): stream changes from databases.

Worked examples

Example 1 — Daily financial consolidation

Need: Accurate ledger by 7 AM next day. Source: ERP, payments, invoices.

Pattern: Batch (nightly).
Why: Accuracy over immediacy; controlled reprocessing; predictable cost.
Notes: Include backfill path and data quality checks before publish.

Example 2 — Fraud flag within seconds

Need: Flag suspicious transactions within 5 seconds.

Pattern: Streaming (low-latency, stateful windows).
Why: Real-time requirement; event-time rules; enrichment from reference data.
Notes: At-least-once + idempotent sink; dedup on transaction_id.

Example 3 — Marketing attribution with both speed and accuracy

Need: Near-real-time attribution for monitoring and daily corrected numbers for reporting.

Pattern: Hybrid (stream for provisional metrics; batch for corrected results).
Why: Streaming gives quick visibility; batch corrects late/out-of-order events.
Notes: Mark provisional vs final; reconcile using event time and watermarks.

Example 4 — IoT telemetry at scale

Need: Aggregate device metrics every minute; alert on spikes within 30 seconds.

Pattern: Micro-batch (15–30s windows) plus alerting on sliding windows.
Why: Balances cost and latency; windowing simplifies state management.
Notes: Apply backpressure; buffer bursts; define late data policy.

Architecture choices (when to use what)

Choose Batch when…

SLAs are minutes to days and accuracy matters more than immediacy.
Sources are bulk-friendly (files, snapshots, nightly exports).
You need heavy transformations, dimensional modeling, and reprocessing simplicity.

Choose Streaming when…

SLAs are seconds to a couple of minutes.
Event-driven products: alerts, personalization, monitoring.
CDC or clickstream where freshness is critical.

Choose Micro-batch when…

You want near-real-time (e.g., 15–60s) with simpler operations than record-by-record.
Downstream sinks prefer batches (warehouse loads) but with small delay.
Cost/complexity of full streaming is not justified.

Late and out-of-order data

Define allowable lateness (e.g., 10 minutes) and use watermarks.
Provide a correction path (retractions, upserts, or batch reconciliation).
Use event time in windows to avoid skew from processing delays.

Design steps (quick blueprint)

Clarify SLA: freshness target (P95), accuracy rules, and allowable lateness.
Profile data: peak/avg throughput, event size, skew, ordering guarantees.
Contract schema: keys, event time, idempotency key, versioning, nullable fields.
Pick pattern: batch vs streaming vs micro-batch; justify with SLA and cost.
Plan resilience: checkpoints, retries, dead-letter queue, replay strategy.
Define outputs: sinks, partitioning, compaction/upserts, governance and lineage.

Who this is for

Aspiring and practicing Data Architects defining integration patterns.
Engineers deciding between batch and streaming for new pipelines.
Analysts and product partners clarifying data freshness needs.

Prerequisites

Basic data modeling (keys, partitions, schemas).
ETL/ELT fundamentals and warehouse/lake concepts.
Comfort with SLAs, metrics (P95/P99), and simple capacity math.

Learning path

Start with batch fundamentals and scheduling.
Learn streaming concepts: events, windows, watermarks, delivery semantics.
Practice hybrid designs and late data handling.
Add reliability: checkpoints, DLQs, replay, idempotent sinks.

Exercises

These mirror the graded exercises below. Do them here, then record your final answers in the exercise section. Everyone can take the test; only logged-in users will have progress saved.

Exercise 1 — Map needs to Batch, Streaming, or Micro-batch

For each scenario, choose Batch, Streaming, or Micro-batch and write one-sentence justification.

A. CFO monthly financial close by next morning.
B. Payment fraud rule within 5 seconds.
C. Product search suggestions updated every 15 minutes.
D. Daily sales report by 7 AM with corrected late transactions.
E. Ad-click counter for a live dashboard with sub-10-second freshness.

Show solution (sample)

A: Batch — accuracy over immediacy.
B: Streaming — low-latency alerting; stateful rules.
C: Micro-batch — small windows balance cost and freshness.
D: Batch — reprocess for corrections and completeness.
E: Streaming — near-real-time aggregation.

Exercise 2 — Throughput, partitions, and storage

Assume: average 5,000 events/sec; peak 20,000 events/sec; 1 KB/event; P95 end-to-end 5 seconds; allow 10 minutes lateness.

1) Compute peak ingress MB/s.
2) Estimate raw storage per day and 30 days.
3) Propose topic partitions and consumer concurrency to meet SLA.
4) Choose micro-batch interval or true streaming and justify.
5) State dedup/idempotency strategy and checkpointing.

Show solution (sample)

1) Peak ingress: ~20 MB/s (20,000 × 1 KB).
2) Daily: 5,000 × 86,400 × 1 KB ≈ 432 GB; 30 days ≈ 12.96 TB (raw, before compression).
3) Partitions: 24–48 partitions; start with 32 and 8–16 consumers for headroom.
4) Micro-batch 1s windows or true streaming; either fits 5s P95 with checkpoints; micro-batch eases sink loads.
5) Idempotent upserts keyed by event_id; at-least-once delivery; enable checkpoints and DLQ for poison messages.

Self-check checklist

I matched SLAs to the simplest pattern that meets them.
I accounted for late/out-of-order events with watermarks and corrections.
I planned idempotency and dedup at sinks.
I sized partitions/concurrency for peak, not average.
I included replay/DLQ and backfill paths.

Common mistakes (and how to catch them)

Choosing streaming when batch is enough: ask for a measurable SLA in seconds/minutes.
Confusing event time and processing time: always embed event_time; window on event time.
No strategy for late data: define lateness, watermark, and correction process.
Ignoring idempotency: require unique keys and upsert/merge semantics.
Under-sizing for peak: design for P95/P99 bursts; test with load.
Lack of replay: ensure checkpoints and a reliable replay from storage.

Practical projects

Build a nightly batch load that publishes a curated table with data quality checks.
Build a streaming aggregator that emits 10-second rolling counts and handles late data with a 5-minute watermark.
Add a dead-letter queue and a replay script that reprocesses failed events safely.
Demonstrate schema evolution: add a nullable field, roll forward; then reprocess a day.

Mini challenge

You have a product metrics dashboard needing P95 freshness under 2 minutes, plus an audited monthly report. Propose a design: which parts are streaming, which are batch, how you’ll handle late data, and how you’ll reconcile provisional vs final numbers. Keep your answer to 5–7 sentences.

Next steps

Study CDC patterns and how they integrate with streaming pipelines.
Practice cost-aware design: batch where possible, streaming where necessary.
Add observability: end-to-end latency, lag, watermark delay, and error rates.

Menu

Batch And Streaming Integration Basics

Table of Contents

Why this matters

Concept explained simply

Mental model

Key terms at a glance

Worked examples

Architecture choices (when to use what)

Design steps (quick blueprint)

Who this is for

Prerequisites

Learning path

Exercises

Self-check checklist

Common mistakes (and how to catch them)

Practical projects

Mini challenge

Next steps

Practice Exercises

Map needs to Batch, Streaming, or Micro-batch

Instructions

Expected Output

Throughput, partitions, and storage plan

Batch And Streaming Integration Basics — Quick Test

Have questions about Batch And Streaming Integration Basics?

AI Assistant