Why this matters
As an ETL Developer, your jobs often fail or slow down due to out-of-memory errors, heavy disk spills, or temp space exhaustion. Good memory and temp storage management keeps pipelines fast, predictable, and cost-efficient.
- Load large files in chunks without crashing.
- Run joins and sorts efficiently with predictable spill behavior.
- Tune parallelism so multiple tasks fit in memory without overwhelming temp storage.
- Prevent noisy-neighbor effects by isolating temp directories and setting quotas.
Who this is for
- ETL Developers moving from small datasets to tens or hundreds of GB.
- Data Engineers responsible for stable nightly or streaming jobs.
- Anyone optimizing joins, aggregations, or shuffles in batch/streaming systems.
Prerequisites
- Basic ETL concepts: extract-transform-load, batch vs. streaming.
- Familiarity with data formats (CSV/JSON/Parquet) and SQL joins.
- Comfort with resource limits (CPU, RAM, disk) on your runtime environment.
Learning path
- Understand memory vs. temp storage and when each is used.
- Estimate working set size and set safe batch sizes.
- Tune joins, sorts, and partitions to limit spills.
- Add monitoring to catch issues early and iterate.
Concept explained simply
Memory is your fast workbench. Temp storage (local SSD/HDD or network disk) is the overflow table on the side. When the workbench is full, you place items on the side table. That keeps you going but slows you down. When both are full, you stop and fail.
Mental model
- Working set: the slice of data you handle at one time (batch, partition, micro-batch).
- Headroom: reserve 20β40% memory for overhead (framework, JVM/GC, Python/worker, buffers).
- Spill: planned overflow to temp storage; acceptable if controlled and sized.
- Throttle: reduce concurrency or batch size when nearing limits.
Quick sizing rules of thumb
- Estimated row width: sum of column sizes (ints ~8 B, floats ~8 B, booleans ~1 B, datetime ~8β16 B, string ~avg length + 4β24 B overhead).
- Rows per batch β (memory_budget_bytes Γ 0.6) / estimated_row_width. Keep ~40% headroom.
- Target output file size for batch jobs: 128β256 MB per partition to balance listing and parallelism.
Key techniques and heuristics
- Chunking and streaming: read/process data in bounded chunks or streaming iterators to cap memory.
- Join strategy: broadcast small tables; otherwise prefer sort-merge joins for bounded memory and predictable spills.
- Batch sizing: calculate safe batch rows using a memory budget and row width; keep headroom.
- Parallelism vs. memory: total_concurrency Γ per_task_memory β€ available_memory. Adjust either dimension.
- Temp storage placement: prefer fast local SSD where possible; isolate to a dedicated directory; monitor free space and I/O wait.
- Spill control: configure spill thresholds, buffer sizes, and partition counts to avoid giant spill files.
- File formats: columnar (e.g., Parquet) with compression reduces memory and I/O; avoid loading entire files into memory.
- Cleanup: ensure temp files are auto-deleted on success/failure; name temp paths per-job for easy purging.
- Observability: track GC pauses, RSS/heap, spill metrics, temp dir usage, I/O wait, and time per stage.
How to set a memory budget per task
- Measure available memory on the worker (e.g., 32 GB).
- Reserve 30β40% for overhead (framework, OS cache). Example: 32 GB Γ 0.6 = 19.2 GB usable.
- Divide by expected concurrency. Example: 4 tasks β ~4.8 GB per task.
- Apply batch sizing with 60% of per-task budget for rows to keep headroom.
Choosing temp storage capacity
- Expected spill size β shuffle/sort/join intermediate Γ spill_ratio. As a rough start, set temp capacity β₯ 0.5β1.0 Γ input_size for heavy sorts/joins.
- Use separate disk/partition for temp to avoid competing with OS and logs.
- Set high-watermark alerts at 70% and 90% disk usage.
Worked examples
Example 1: Converting 50 GB CSV to Parquet
Context: 50 GB CSV, avg row ~150 B, worker memory 16 GB, run 2 tasks in parallel.
- Usable memory β 16 GB Γ 0.6 = 9.6 GB β per task β 4.8 GB.
- Rows per batch β 4.8 GB Γ 0.6 / 150 B β (2.88 GB)/150 B β 20,000,000 rows (adjust down if strings vary).
- Partitions: 50 GB / 256 MB β ~200 partitions.
- Temp storage: target β€ input size Γ 0.3 for mostly write-only conversion. Allocate β₯ 20 GB free temp space.
Why this works
Batch size fits per-task memory with headroom. Partition target yields reasonable file sizes and balanced tasks. Temp space covers small spills and writer buffers.
Example 2: Joining 200M fact rows with 5M dimension rows
Context: Fact ~200M rows (200 GB), Dim ~5M rows (5 GB in memory compressed). Worker pool total usable memory 64 GB.
- Try broadcast join if dim fits comfortably in memory on each worker: 5 GB may be tight. After overhead, assume ~3 GB free per task β risky.
- Switch to sort-merge join: bounded memory, predictable spill.
- Plan: pre-partition both datasets by join key; set partitions ~ input_size / 256 MB β 800 for fact, 20β40 for dim, then coalesce dim to match fact partitions.
- Temp storage: allocate β₯ 0.5 Γ input_size (~100 GB) for sort/shuffle spill across workers.
Outcome
Fewer OOMs, stable throughput with controlled disk spills. Large broadcast avoided.
Example 3: Streaming micro-batches with bursty input
Context: Stream peaks at 2Γ normal rate; memory per executor 8 GB, 3 concurrent tasks.
- Set micro-batch duration to limit records per batch to fit memory budget.
- Add backpressure: reduce concurrent tasks from 3 to 2 when spill rate or GC time exceeds thresholds.
- Use compact state stores and periodic checkpoints to bound memory.
Result
Latency remains within SLA; no OOM during bursts; spills occur but remain under 70% of temp capacity.
Exercises (hands-on)
These mirror the exercises below. Do them using your preferred ETL tool or by writing a plan/config with calculations. Use the checklist to validate.
- Checklist:
- You computed a memory budget with headroom.
- You derived a rows-per-batch or partition count.
- You estimated temp storage needs and set thresholds.
- You defined monitoring signals to trigger backoff.
Exercise 1: Batch sizing and temp planning
You have 24 GB worker memory and want to run 3 tasks in parallel to sort a 120 GB dataset with estimated row width 120 B.
- Compute per-task memory budget with 40% headroom.
- Compute rows per batch using 60% of per-task memory.
- Choose a target file size per partition and compute partition count.
- Estimate temp storage capacity for sorting and set two alert thresholds.
Exercise 2: Join strategy under constraints
You must join a 90 GB fact table with a 2 GB dimension table. Each worker can safely spare 3 GB per task. Decide between broadcast and sort-merge, compute safe concurrency, and propose spill controls.
Common mistakes and how to self-check
- Too many parallel tasks: memory thrashing and excessive spills. Self-check: monitor RSS/heap per task and GC time; reduce concurrency until stable.
- Tiny partitions: too many small files, scheduling overhead. Self-check: keep 128β256 MB targets; compact if below 64 MB.
- Single temp directory for all jobs: contention and full disks. Self-check: isolate per-job temp paths and set quotas.
- Ignoring string/JSON variability: underestimates row width. Self-check: sample real data for average and 95th percentile sizes.
- No cleanup on failure: temp bloat. Self-check: verify temp paths are purged at job end.
Practical projects
- Project A: Re-tune a large sort job
- Deliver: before/after metrics (runtime, max RSS, spill GB, temp peak%).
- Success: runtime improves β₯ 20%, no OOM, temp never exceeds 80% capacity.
- Project B: Memory-safe dimension join
- Deliver: join plan, partitioning scheme, and spill thresholds.
- Success: stable run across 3 consecutive days with consistent metrics.
- Project C: Streaming backpressure demo
- Deliver: configuration that adapts concurrency when GC time > 20% or spill rate rises.
- Success: SLA latency maintained under burst conditions.
Mini challenge
Pick one of your pipelines and produce a one-page resource plan: working set estimate, batch size, partitions, concurrency, temp capacity, and the 3 metrics you will watch. Run it and adjust once based on data.
Next steps
- Apply these rules to one real job this week; record metrics.
- Automate a preflight check that calculates batch size and partitions from input size.
- Add alerts for temp usage at 70% and 90%.
Quick Test
Take the quick test to check your understanding. Available to everyone; only logged-in users get saved progress.