Why this matters
As an Analytics Engineer, you decide how data lands in the warehouse and how fresh it is. Choosing batch or micro-batch directly affects dashboard latency, cost, and reliability.
- Keep exec dashboards fresh at acceptable cost.
- Ingest clickstream or app events without overloading your warehouse.
- Design schedules and windows that handle late data safely.
Concept explained simply
Batch processing loads data in chunks on a schedule (e.g., hourly, daily). Micro-batch loads smaller chunks very frequently (e.g., every 1–5 minutes). Both are discrete windows; neither is true continuous streaming.
Mental model: faucet and buckets
Imagine a faucet (incoming data) and buckets (loads). Batch uses big buckets at longer intervals: low overhead, higher latency. Micro-batch uses small buckets often: low latency, higher overhead. Your job is to pick the bucket size so the sink (warehouse) isn’t flooded, and the user gets water (data) when needed.
Key tradeoffs
- Freshness vs cost: Smaller windows improve freshness but increase job frequency, overhead, and often cost.
- Reliability: Bigger windows are simpler and more forgiving; tiny windows amplify transient failures.
- File/object sizes: Overly small windows can create too many tiny files, slowing queries.
Decision guide (quick)
- Need near real-time (≤5 minutes) for critical metrics? Use micro-batch.
- Daily/weekly reports with no urgency? Use batch.
- Unstable source APIs or rate limits? Prefer larger batch windows.
- Warehouse concurrency/cost limits? Avoid overly frequent micro-batches.
Worked examples
Example 1: Daily SaaS exports
Source: CRM exports one full file at ~01:00 daily. Analysts need data by 06:00.
- Pick: Batch (daily).
- Schedule: Start 01:30, finish by 02:00, leave retries until 05:00.
- Notes: Use incremental keys if available to reduce load volume.
Example 2: Clickstream dashboards within 3 minutes
Source: Web events 30k/minute. Dashboard SLA freshness ≤3 minutes.
- Pick: Micro-batch (1–2 minute windows).
- Setup: 2-minute ingestion windows, deduplicate by event_id, watermark 10 minutes for late arrivals.
- Notes: Compact small files hourly into partitioned tables to avoid file sprawl.
Example 3: Finance ledger consistency
Source: Accounting system with posting at end of day; must be reconciled.
- Pick: Batch (daily after close).
- Setup: Single daily load after reconcile flag is true.
- Notes: Emphasize correctness and idempotent upserts over speed.
Example 4: Product analytics near real-time
Source: App events 50k/min; PMs want funnels updated every 2 minutes.
- Pick: Micro-batch (1-minute windows).
- Setup: Window = 1 minute, watermark = 5 minutes, late-arrival handling with upserts.
- Notes: Enforce partitioning by event_date and hour; compact every 30–60 min.
How to choose window size
- Define freshness target (e.g., T99 ≤ 5 minutes).
- Measure source arrival variance (how late can events be?).
- Estimate job overhead and warehouse concurrency.
- Pick a window that meets freshness without breaching quotas; set a watermark to catch late data.
- Plan file compaction for micro-batch (hourly or daily).
Practical defaults
- Batch: daily or hourly; align to source update times.
- Micro-batch: 1–5 minutes; watermark 3–10x the window length.
- Compaction: hourly for hot partitions; daily for warm partitions.
Design patterns and safeguards
- Idempotent loads: Upsert on a stable key (event_id) so reruns are safe.
- Watermarks: Ignore events older than a threshold in micro-batch; handle a periodic backfill job for stragglers.
- Backpressure: If jobs queue up, widen the window size or reduce transformations in ingestion.
- Cost control: Use clustering/partitioning; compact small files.
Exercises
Do these to solidify your understanding. The Quick Test below is available to everyone; progress is saved only if you are logged in.
Exercise 1: Schedule a daily batch safely (ex1)
Scenario: A CRM drops a 5M-row CSV at ~01:00 daily. Analysts need data by 06:00. Warehouse costs should be minimized, and the source sometimes arrives late by up to 30 minutes.
- Choose batch or micro-batch and justify.
- Propose a start time, retries, and a cutoff time.
- State how you ensure idempotency and partial-failure recovery.
Exercise 2: Configure micro-batch windows (ex2)
Scenario: Clickstream events ~50k/min. PMs want dashboards within 2 minutes. Late events can arrive up to 5 minutes late. Warehouse has limited concurrency.
- Pick a window size and watermark.
- Describe deduplication and file compaction strategy.
- Explain how you handle backpressure if jobs queue.
Self-check checklist
- Your choice matches the freshness requirement.
- You accounted for source arrival variance with a watermark or buffer.
- Your plan is idempotent and includes retries.
- You managed cost via window sizing and compaction.
Common mistakes and self-check
- Too-small windows creating tiny files and high cost. Self-check: Are partitions flooded with sub-1MB files? Add compaction or widen windows.
- No watermark for late events. Self-check: How many events arrive after the window closes? Track late-arrival rate and set thresholds.
- Non-idempotent inserts. Self-check: Can reruns duplicate rows? Use upsert/merge on unique keys.
- Ignoring source schedule. Self-check: Do you run before data is ready? Align batch start with source readiness.
- Overloading concurrency. Self-check: Do frequent jobs queue? Increase window length or consolidate tasks.
Mini tasks
- Write one sentence that defines when you would pick batch over micro-batch.
- List the three parameters you would tune first for micro-batch: window, watermark, compaction cadence.
- Sketch a retry strategy for a job that can fail due to transient API limits.
Practical projects
- Build a daily batch pipeline that ingests a CSV, upserts into a partitioned table, and retries on failure. Add a summary model that validates row counts.
- Create a micro-batch ingestion for events (simulate with files every minute), deduplicate by event_id, apply a 10-minute watermark, and compact hourly.
- Cost/freshness tuner: Run the same dataset with 1-, 2-, and 5-minute windows and compare warehouse cost, files created, and dashboard freshness.
Who this is for
- Analytics Engineers designing ingestion to the warehouse.
- Data Analysts owning dashboard SLAs who need practical freshness choices.
- Data Engineers standardizing ingestion patterns before streaming.
Prerequisites
- Basic SQL (SELECT, JOIN, INSERT/UPSERT/MERGE).
- Familiarity with your warehouse partitioning/clustering.
- Basic job scheduling concepts (cron or orchestrator).
Learning path
- This lesson: Batch vs micro-batch tradeoffs and patterns.
- Next: Incremental models and idempotent upserts.
- Then: Late-arriving data handling (watermarks and backfills).
- Finally: Observability for freshness, cost, and reliability.
Next steps
- Complete the two exercises and take the Quick Test below.
- Pick one real pipeline and adjust its window size or compaction. Measure impact.
- Document your SLA targets and watermark policy for your team.
Mini challenge
Your marketing team wants lead-gen metrics within 10 minutes, but the ad platform API rate limits burst calls. Propose a window size, retry spacing, and a watermark, and explain how you will stay under rate limits while meeting freshness.