luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Performance Testing

Learn Performance Testing for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As an ETL Developer, your pipelines must move data reliably at scale. Performance testing helps you prove that jobs meet SLAs, stay cost-efficient, and handle growth without surprises.

  • Validate that daily loads finish before business users arrive.
  • Catch bottlenecks (skewed joins, I/O limits, bad partitioning) early.
  • Estimate cost before increasing cluster size or parallelism.
  • Create baselines so future changes don’t regress performance.

Concept explained simply

Performance testing is a focused experiment: you run your pipeline with representative data and measure throughput, latency, resource use, and error rates. You compare those results against expectations (SLA) and look for improvement opportunities.

Mental model

Think of your pipeline like a series of pipes. The flow is limited by the narrowest section (the bottleneck). Testing helps you find the narrowest pipe and decide whether to widen it, add another pipe in parallel, or reduce the flow per pipe.

Key outcomes to aim for
  • A clear baseline: time, throughput, and cost at a known input size.
  • Evidence of the bottleneck: CPU, memory, I/O, network, or data skew.
  • A scaling curve: how runtime changes as data volume or concurrency grows.

Metrics that matter

  • Latency: total job duration; stage/task durations; queue times.
  • Throughput: rows/second or MB/second per stage and end-to-end.
  • Resource utilization: CPU%, memory use, I/O bandwidth, network.
  • Cost: runtime Ă— resources used (treat as approximate).
  • Reliability under load: error rates, retries, out-of-memory events.
  • Tail latency: p95/p99 times for stages or tasks (not just average).
How to create a good baseline
  • Fix versions/configuration and keep the environment stable.
  • Use representative data distribution (not only small samples).
  • Warm-up once, then time 3+ runs and take the median.
  • Record input size, schema, cluster size, parallelism, and seed.

Plan your test in 5 steps

  1. Define the SLA: e.g., “Process 500 GB in under 45 minutes; p95 stage < 5 min.”
  2. Select datasets: baseline (e.g., 100 GB), target (500 GB), and stress (700–800 GB).
  3. Choose scenarios: single-run baseline, concurrency (N parallel jobs), and scale-up (more data).
  4. Pick fixed settings: cluster size, partitions, retry policy, memory limits.
  5. Decide what to measure and how you’ll record results (simple run log).
Simple run log template
Run ID: 2026-01-11-01
Input: 500 GB, 5.2B rows
Env: 16 workers, 8 vCPU each; parallelism=64
Metrics: total=41m, p95 stage=3.8m, throughput=2.1M rows/s, CPU=72%, IO=68%
Notes: skew on customer_id detected; added salting for next run

Worked examples

Example 1: Finding a throughput bottleneck

Scenario: Daily join of fact (3B rows) with dim (15M rows) is missing the SLA by 20%.

  • Observation: CPU ~40%, I/O ~85%, network moderate, long shuffle reads.
  • Action: Increase file read parallelism and ensure larger block size; avoid small files.
  • Result: Throughput +25%, job now meets SLA.
Why it worked

I/O was the narrowest pipe. Improving read parallelism and reducing small-file overhead raised effective bandwidth.

Example 2: Fixing data skew in a join

Scenario: Some tasks run 8–10x longer during a customer_id join.

  • Observation: Long tail tasks; a few keys dominate the dataset.
  • Action: Add a salt key to heavy keys or pre-aggregate high-frequency keys; increase shuffle partitions moderately.
  • Result: Tail latency (p95) down 60%, total runtime down 35%.
Why it worked

Skew pushed too much data to a few reducers. Salting and better partitioning distributed work more evenly.

Example 3: Measuring concurrency effects

Scenario: Running 6 daily pipelines in parallel causes timeouts.

  • Observation: CPU 90–95%, I/O 70%, queuing delays before heavy stages.
  • Action: Stagger start times and cap per-job parallelism; test 2, 3, 4 concurrent jobs.
  • Result: 4 concurrent jobs with reduced per-job threads delivers best total throughput without timeouts.
Why it worked

Total cluster capacity is finite. Reducing per-job contention improved overall throughput and stability.

How to run and read tests

  • Warm-up: Run once; ignore timing. Caches and JIT compilers stabilize.
  • Measure: Run at least 3 times; use median. Track p95/p99 for critical stages.
  • Change one thing at a time: partition count, join strategy, memory limit, or file layout.
  • Interpret: High CPU = compute-bound; High I/O = bandwidth/latency bound; Long tails = skew or stragglers.
Quick diagnosis guide
  • CPU >85%, I/O <50%: optimize compute (filter earlier, prune columns, change join).
  • I/O >80%, CPU <60%: batch reads, larger blocks, fewer tiny files, cache key data.
  • High GC/memory: reduce shuffle size, spill earlier, increase partitions.
  • Long tail tasks: detect skew; use salting or pre-aggregation.

Hands-on exercises

These mirror the practice tasks below. Do them on a sandbox or sample dataset you control.

Exercise 1: Establish a baseline and scale curve (matches ex1)

  1. Pick a representative pipeline (read, join, aggregate, write).
  2. Run warm-up, then 3 timed runs at input sizes: 50 GB, 100 GB, 200 GB.
  3. Record median time, p95 stage time, throughput, CPU/I/O for each size.
  4. Plot size vs runtime; note if scaling is near-linear or worse.
  • I recorded environment and configs
  • I timed 3 runs per size and used medians
  • I identified the slowest stage

Exercise 2: Diagnose and reduce tail latency (matches ex2)

  1. Choose a join-heavy stage with uneven task times.
  2. Collect per-task durations and data volume per key or partition.
  3. Apply one mitigation: salting, pre-aggregating hot keys, or increasing partitions moderately.
  4. Re-run and compare p95 and total runtime.
  • I confirmed skew using per-key or per-partition metrics
  • I changed only one setting
  • I verified improvement with new measurements

Common mistakes and self-check

  • Testing with unrealistic small samples. Self-check: Does data distribution match production?
  • Changing multiple settings at once. Self-check: Can you attribute improvement to one change?
  • Relying on averages only. Self-check: Did you record p95/p99 stage times?
  • Comparing runs across unstable environments. Self-check: Are versions, cluster size, and configs fixed?
  • Ignoring cost. Self-check: Did you estimate the resource cost of the improvement?

Practical projects

  • Build a performance baseline report: a simple HTML or CSV log that captures metrics for 3 data sizes and 2 concurrency levels.
  • Create a skew detection notebook: input a sampled dataset, output key frequency histograms and suggested salting factors.
  • Design a regression test: run a nightly small performance job and alert if runtime deviates by >15% from baseline.

Who this is for

  • ETL Developers responsible for data movement and transformations.
  • Data Engineers optimizing batch and micro-batch jobs.
  • Anyone setting or meeting pipeline SLAs.

Prerequisites

  • Comfort with SQL and basic ETL patterns (extract, join, aggregate, write).
  • Understanding of partitions, parallelism, and file formats.
  • Ability to run pipelines in a controlled environment.

Learning path

  1. Learn to measure: run/warm-up/repeat; collect medians and p95.
  2. Learn to diagnose: map metrics to bottlenecks (CPU, I/O, skew, memory).
  3. Learn to fix: partitioning, salting, pruning, join strategies, file layout.
  4. Learn to scale: concurrency testing and cost-aware tuning.
  5. Automate: baseline reports and simple alerts for regressions.

Next steps

  • Complete the exercises below and record your results.
  • Take the quick test to confirm your understanding.
  • Apply one optimization to a real pipeline and measure the impact.

Mini challenge

Your 100 GB job completes in 18 minutes. At 300 GB it takes 75 minutes. Find the most likely issue and one action to test next. Write a 3-line plan describing: hypothesis, metric to watch, and the single change you will make.

Check your knowledge

Take the quick test below. Everyone can try it; sign in to save your progress and track improvements over time.

Practice Exercises

2 exercises to complete

Instructions

Goal: Build a reliable baseline and observe how runtime scales with data size.

  1. Select a pipeline that reads, joins, aggregates, and writes.
  2. Prepare three input sizes (e.g., 50, 100, 200 GB or proportional row counts).
  3. Warm-up once. Then run 3 times per size. Record medians for: total time, p95 stage, throughput, CPU%, I/O%.
  4. Plot size vs runtime. Note whether scaling is near-linear.

Record results in a simple table or CSV.

Expected Output
A short report with medians per input size and a note stating whether scaling is near-linear or indicates a bottleneck.

Performance Testing — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Performance Testing?

AI Assistant

Ask questions about this tool