Why this matters
As an ETL Developer, you must size pipelines and set realistic expectations. Estimating load volumes and SLAs ensures your jobs finish on time, avoid backlogs, control storage costs, and meet business needs like daily reports or near real-time dashboards.
- Plan compute and storage capacity before building.
- Schedule jobs to meet reporting deadlines.
- Negotiate clear, testable SLAs with stakeholders.
- Detect and prepare for peak traffic and growth.
Note: Everyone can take the test here; only logged-in users get saved progress.
Concept explained simply
Load volume is how much data you move and transform (rows, files, or bytes). An SLA (Service Level Agreement) states how fast and how reliably your data arrives and how complete it is.
Mental model
Think of a pipeline like a conveyor belt:
- Volume: how many boxes arrive per time.
- Window: the time you have to move them.
- Throughput: how fast the belt runs.
- Buffer: safety margin for spikes.
- SLA: promise to customers about delivery time and completeness.
Step-by-step estimation framework
- Baseline volume: Estimate average rows/files/day and average record or file size.
daily_uncompressed_bytes = records_per_day * avg_bytes_per_record - Peak factor: Identify peak multipliers (hourly/day-of-week/seasonal). Use recent 95th–99th percentile rates.
peak_rate = avg_rate * peak_multiplier - Growth: Apply expected growth (weekly/monthly/annual) and known business events.
forecast = baseline * (1 + growth_rate) - Compression: Estimate compressed size if your storage/transport compresses data (e.g., 3:1).
compressed_bytes = uncompressed_bytes / compression_ratio - Load window: Define the time available to finish a load (e.g., 02:00–04:00 daily; or 5-minute microbatches).
- Required throughput: Ensure you can process the peak batch within the window.
throughput_required = peak_batch_size / window_duration - Storage footprint: Include raw, staging, and curated zones with retention.
storage_total ≈ (raw_bytes*raw_days) + (staging_bytes*staging_days) + (curated_bytes*curated_days) - Buffers & retries: Add 20–50% headroom for spikes, retries, and schema drifts. Define backfill and catch-up strategy.
What to include in an SLA
- Freshness/latency: how quickly data becomes available (e.g., within 15 minutes).
- Availability: percent of time the data pipeline meets its targets (e.g., 99.5% monthly).
- Completeness: minimum percent of records/partitions delivered (e.g., ≥ 99.9%).
- Quality guardrails: e.g., duplicate or null rate thresholds.
- Recovery goals: RPO (max data loss) and RTO (max time to recover).
Worked examples
Example 1: Microbatch orders stream
- Baseline: 1.8M events/day, 800 bytes each
- Peak factor: 2.0 during campaigns
- Compression: 4:1
- Microbatch cadence: every 5 minutes; freshness SLA: data within 10 minutes
See calculation
- Average per 5 min: 1,800,000 / (24*12) ≈ 6,250 events
- Peak per 5 min: 6,250 * 2.0 ≈ 12,500 events
- Peak size (uncompressed): 12,500 * 800 B ≈ 10,000,000 B ≈ 10 MB
- Peak size (compressed): 10 MB / 4 ≈ 2.5 MB
- To meet 10-minute freshness, process each 5-minute batch within ≤ 5 minutes: throughput ≥ 2.5 MB / 5 min = 0.5 MB/min (plus buffer)
- Daily compressed: (1.8M * 800 B) / 4 ≈ 360 MB
Example 2: Web logs daily rollup
- Daily uncompressed: 120 GB; compression 4:1 → 30 GB compressed
- Load window: 2 hours after midnight
- SLA: daily aggregates ready by 02:15
See calculation
- Throughput required: 30 GB / 2 h = 15 GB/h ≈ 250 MB/min
- Add 30% buffer: target ≥ 325 MB/min
- SLA phrasing: “Daily log aggregates available by 02:15 with ≥99.9% completeness; 99.5% monthly availability.”
Example 3: Hourly file drops
- 48 CSV files/day, each 50 MB compressed
- Files arrive every 30 minutes in pairs
- SLA: each arrival batch processed within 20 minutes
See calculation
- Per arrival batch: 2 * 50 MB = 100 MB
- Throughput required: 100 MB / 20 min = 5 MB/min
- With 50% buffer for spikes/retries: ≥ 7.5 MB/min sustained
Choosing and writing SLAs
- Freshness (a.k.a. latency to availability): “events visible in analytics within 10 minutes of capture.”
- Availability: “pipeline meets freshness and completeness 99.5% of minutes/month.”
- Completeness: “≥ 99.9% of expected partitions/files/records delivered per run.”
- RPO/RTO: “RPO ≤ 5 minutes, RTO ≤ 30 minutes for critical incidents.”
Good vs weak phrasing
- Good: “Daily sales snapshot ready by 06:00 UTC; if delayed, automated alert by 06:05; catch-up by 06:30.”
- Weak: “As fast as possible and usually fine.”
Capacity and scheduling checks
- Compute throughput: compare required vs measured. Keep 20–50% buffer.
- Parallelism: break into partitions (by date/hour/store) to run concurrent tasks.
- Scheduling: avoid overlapping heavy jobs; stagger starts; ensure upstream data cutoffs are met.
- Recovery: define retry backoff, dead-letter queues, and backfill windows.
Quick capacity sanity check
- Required throughput = peak_batch_bytes / window_minutes
- If engine throughput < required * 1.3, reduce batch size, add parallelism, or extend window.
Common mistakes and self-check
- Ignoring peaks: design only for average rates. Fix: use 95th–99th percentile and add buffer.
- Vague SLAs: “near real-time.” Fix: state minutes, percentiles, and time zone.
- No growth plan: pipelines fail months later. Fix: forecast + planned business events.
- Missing upstream cutoffs: loading before data is complete. Fix: define data-ready signals.
- Underestimating storage: forget replicas/staging. Fix: calculate per zone with retention.
Self-check checklist
- [ ] I have average and peak volumes (rows and bytes).
- [ ] I applied growth and compression.
- [ ] I computed required throughput for the tightest window.
- [ ] I added at least 20–50% buffer.
- [ ] SLAs are measurable: freshness, availability, completeness, and recovery goals.
- [ ] Upstream cutoffs/backfill are defined.
Exercises
Try these before viewing solutions.
- Exercise 1 (volume & throughput): see task details below.
- Exercise 2 (write SLAs): see task details below.
- [ ] I wrote my assumptions explicitly.
- [ ] I calculated both average and peak.
- [ ] I included buffer and recovery statements.
Mini challenge
Your product team plans a marketing push: expect 3x traffic for 2 days next month. Current daily baseline is 4M events/day, 1 KB each, compression 4:1. Current microbatch is 10 minutes with freshness SLA of 15 minutes. Propose the new peak throughput requirement and a temporary SLA note for the campaign period. Keep it to two sentences.
Who this is for
- ETL Developers and Data Engineers who design and operate batch or streaming pipelines.
- Analysts/PMs who need realistic data delivery promises.
Prerequisites
- Basic ETL concepts: sources, staging, transforms, targets.
- Comfort with units: rows, bytes, MB/GB, minutes/hours.
- Awareness of your compute platform’s typical throughput.
Learning path
- Understand source behavior and cutoffs.
- Estimate volumes and peaks.
- Compute required throughput and plan parallelism.
- Write measurable SLAs and monitoring criteria.
- Validate with a pilot run and adjust buffers.
Practical projects
- Build a small microbatch pipeline (CSV → staging → warehouse) with metrics on batch size and duration.
- Create a capacity sheet that converts record counts and sizes into throughput and storage by zone.
- Draft SLA statements and wire alerts on freshness breaches.
Next steps
- Automate daily volume reports and peak detection.
- Implement backfill tooling and document RPO/RTO.
- Take the quick test below to lock in the concepts.