Why performance and scalability matter for ETL Developers
As an ETL Developer, your pipelines feed dashboards, ML features, and downstream apps. Performance and scalability determine whether data arrives on time, at the right cost, and without disrupting shared systems. You will routinely choose partition strategies, parallelism, bulk load methods, transform patterns, and testing approaches that balance speed, reliability, and spend.
- Deliver data within SLAs, even as volumes grow 10x.
- Cut compute and storage costs by eliminating waste.
- Design pipelines that scale horizontally and recover predictably.
Quick wins you can apply this week
- Add a selective filter early to shrink data before expensive transforms.
- Switch row-by-row inserts to bulk loading where supported.
- Load into a staging table, then swap into production with an atomic rename.
- Measure 95th percentile latency and throughput; track them per run.
Who this is for
- ETL Developers and Data Engineers building batch or micro-batch pipelines.
- Analysts/Developers transitioning from SQL scripting to production-grade data flows.
- Engineers optimizing slow or costly pipelines.
Prerequisites
- Comfortable with SQL (joins, window functions, DDL).
- Basic scripting in Python or a similar language.
- Familiarity with at least one data warehouse or database, and an orchestrator.
Learning path
Define SLAs, pick metrics (throughput, latency, cost per run), and profile current pipelines. Capture input sizes and critical steps.
Choose partition keys, set parallel read/write safely, and avoid skew. Validate with small-scale tests.
Replace row inserts with bulk methods, staged loads, and batched commits. Plan for retries and idempotency.
Push down filters/aggregations, remove unnecessary sorts/shuffles, broadcast small joins, and cache only where it pays off.
Load into staging with minimal indexes, then apply/refresh indexes after load. Analyze/vacuum as needed.
Tune batch sizes and spill locations. Size temp tablespaces and monitor spills, shuffles, and disk I/O.
Create realistic test data, run multiple iterations, compare cold vs warm cache, and track p95 and p99.
Right-size compute, use incremental loads, avoid egress, and auto-suspend idle resources.
Worked examples
1) Design partitions for a growing fact table
-- Goal: daily loads, queries by event_date and customer_id.
-- Strategy: range partition by month, optionally sub-partition by customer hash.
-- Example DDL (generic SQL-like)
CREATE TABLE fact_events (
event_date date,
customer_id bigint,
event_type text,
amount numeric(12,2),
payload jsonb
)
PARTITION BY RANGE (event_date);
-- Create monthly partitions (automation recommended)
CREATE TABLE fact_events_2025_01 PARTITION OF fact_events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
-- Consider clustering/sorting by (customer_id, event_date) inside partitions
-- to improve range scans and reduce file merges.
Why: Partition pruning reduces scans, and monthly partitions balance manageability with chunk size for compaction and archival.
2) Bulk load with staging + swap
-- Use a staging table with minimal or no indexes
CREATE TEMP TABLE stage_events AS SELECT * FROM fact_events WHERE 1=0;
-- Bulk load from files (example: PostgreSQL COPY)
COPY stage_events FROM '/data/events_2025_01.csv' CSV HEADER;
-- Deduplicate/merge (idempotent upsert pattern)
-- Example: insert only new rows based on a natural key
INSERT INTO fact_events
SELECT s.*
FROM stage_events s
LEFT JOIN fact_events f
ON f.event_date = s.event_date AND f.customer_id = s.customer_id AND f.event_type = s.event_type
WHERE f.event_date IS NULL;
-- Refresh stats for the touched partitions
ANALYZE fact_events;
Why: COPY-style bulk load is typically orders of magnitude faster than row inserts. Staging isolates data quality and allows retries.
3) Parallel read/write and skew handling (Spark example)
# Read with predicate pushdown and controlled partitions
base = (spark.read
.format('parquet')
.load('s3://bucket/events/')
.where("event_date >= '2025-01-01'")
)
# Repartition by a high-cardinality key to parallelize joins
by_cust = base.repartition(200, 'customer_id')
# Broadcast a small dimension table
dim = spark.read.format('parquet').load('s3://bucket/dim_customers/')
joined = by_cust.join(F.broadcast(dim), 'customer_id', 'left')
# Write partitioned output for pruning later
(joined
.repartition(200, 'event_date')
.write
.mode('overwrite')
.partitionBy('event_date')
.parquet('s3://bucket/curated/fact_events'))
Why: Repartitioning by a suitable key balances tasks. Broadcasting small dimensions removes large shuffles. Writing partitioned output enables pruning.
4) Minimize data movement via SQL pushdown
-- Instead of extracting full tables to a compute engine, push filters and aggregations to the database.
WITH changed AS (
SELECT *
FROM source.orders
WHERE updated_at >= (CURRENT_DATE - INTERVAL '1 day')
), prepped AS (
SELECT customer_id, order_date::date AS d, SUM(amount) AS gross
FROM changed
WHERE status = 'completed'
GROUP BY customer_id, d
)
MERGE INTO mart.daily_sales t
USING prepped s
ON (t.customer_id = s.customer_id AND t.d = s.d)
WHEN MATCHED THEN UPDATE SET gross = s.gross
WHEN NOT MATCHED THEN INSERT (customer_id, d, gross) VALUES (s.customer_id, s.d, s.gross);
Why: Pushing computation to the database avoids network egress and leverages optimized engines for grouping/joining.
5) Index lifecycle around bulk loads
-- Before: heavy indexes slow loads due to maintenance per row.
-- Strategy: load into staging with minimal indexes, then apply.
-- Drop non-critical indexes before a massive backfill
DROP INDEX IF EXISTS idx_fact_events_customer_date;
-- Perform the bulk load into staging, merge, then recreate
CREATE INDEX CONCURRENTLY idx_fact_events_customer_date
ON fact_events (customer_id, event_date);
-- Refresh stats after significant changes
ANALYZE fact_events;
Why: Index writes during bulk load cause extra I/O. Rebuilding afterward is often faster overall.
Drills and exercises
- Pick one pipeline and record baseline: input size, runtime, peak memory, p95 latency.
- Enable partition pruning for a large table; measure scan bytes before and after.
- Replace row inserts with a bulk load + staging approach; compare throughput.
- Identify one shuffle-heavy transform and reduce it (broadcast, better join key, or pre-aggregation).
- Add a synthetic load test with 2x data volume; verify runtime scaling and costs.
Common mistakes and how to fix them
- Choosing low-cardinality partition keys leading to hotspots. Fix: partition by time plus a higher-cardinality secondary key or hash bucketing.
- Over-indexing targets during loads. Fix: stage loads, minimal indexes first, rebuild post-load.
- Moving data to compute unnecessarily. Fix: push down filters/joins/aggregations into the database or storage engine.
- Too many small files or micro-batches. Fix: compact files, use sensible batch sizes, and coalesce outputs.
- Ignoring memory spills. Fix: tune batch size, increase shuffle spill paths, and reduce sorts.
- No cost guardrails. Fix: set concurrency limits, auto-suspend idle compute, and prefer incremental over full refresh.
Debugging recipes
Find the bottleneck step
Add timing around each stage (extract, transform, load). Export per-step metrics and compare to historical baselines. The slowest stage gets the first optimization pass.
Throughput vs latency trade-off
Increase parallel writers to raise throughput, but cap concurrency to avoid warehouse slot contention. Measure both end-to-end latency and per-task duration.
Detect and fix data skew
Compute counts per partition key and visualize the distribution. If skewed, use higher-cardinality keys, add salt to keys, or pre-aggregate skewed groups.
Mini project: Incremental sales pipeline at scale
Build a daily incremental pipeline that ingests orders, aggregates daily sales, and serves a dashboard table.
- Partition raw and curated tables by order_date.
- Use staging + bulk load for ingestion. Implement idempotent merges.
- Push down filtering and aggregation into the database where possible.
- Add a broadcast join to enrich orders with customer attributes.
- Measure baseline and p95 runtime, then tune partition count to hit SLA.
- Add a cost control: auto-suspend compute or limit max parallel tasks.
Practical projects
- Backfill historical events using monthly partitions with a safe concurrency limit; compare before/after cost and runtime.
- Convert a slow join pipeline to use broadcast joins and file compaction; validate correctness and performance.
- Create a performance test harness that generates synthetic data 1x, 2x, 5x baseline and reports scaling behavior.
Subskills
- Partitioning And Parallel Loads — Choose effective partition keys, prune scans, and set safe parallelism without overwhelming targets.
- Bulk Load Techniques — Use COPY/LOAD utilities, staging tables, and batched commits for high-throughput ingestion.
- Indexing Awareness For Targets — Plan index lifecycle around loads; refresh stats and avoid unnecessary maintenance work.
- Minimizing Data Movement — Push down compute, filter early, and avoid cross-region transfers.
- Optimizing Transform Steps — Remove unnecessary sorts/shuffles, broadcast small joins, and cache only when it pays off.
- Managing Memory And Temp Storage — Tune batch sizes, spill locations, and temp tablespaces; monitor I/O and spills.
- Performance Testing — Build realistic benchmarks, track p95/p99, and compare cold vs warm runs.
- Cost Aware Pipeline Design — Right-size compute, prefer incremental loads, and cap concurrency to control spend.
Next steps
- Pick one production pipeline and apply two optimizations from above; measure and document impact.
- Schedule a monthly performance review to catch regressions early.
- Continue with the subskills below, then validate your knowledge in the skill exam.
Skill exam
The exam at the end of this page checks practical understanding of partitioning, bulk loads, indexing, transforms, memory, testing, and cost controls. Everyone can take it for free. If you are logged in, your progress and best score will be saved.