Who this is for
ETL Developers and data engineers who want faster pipelines, lower cloud costs, and stable production workloads by reducing unnecessary reads, transfers, and shuffles.
Prerequisites
- Comfort with SQL (SELECT, JOIN, GROUP BY, WHERE)
- Basic understanding of columnar formats (Parquet/ORC) and table partitioning
- Familiarity with a data warehouse or a distributed engine (e.g., Spark, BigQuery, Snowflake, Redshift, Databricks)
Why this matters
Real ETL tasks that benefit directly from minimizing data movement:
- Meeting tight SLA windows by filtering and aggregating data where it lives, instead of exporting it first.
- Reducing cloud egress and network costs by avoiding cross-region or cross-system transfers.
- Stabilizing joins on huge facts by broadcasting small dimensions instead of shuffling terabytes.
- Speeding incremental loads by processing only new/changed data (CDC/watermarks).
- Improving reliability by cutting steps that move data between tools or file formats.
Concept explained simply
Minimizing data movement means doing the smallest possible reads, sends, and shuffles to accomplish your transformation. Bring compute to the data, not data to the compute script. Read fewer rows and fewer columns. Filter early. Join smartly. Keep transformations inside the platform that already stores the data.
Mental model: Data has gravity
Imagine data as heavy objects. Moving them is slow and costly. Instead of lifting everything, work on it where it sits: push filters down, select only the columns you need, and avoid transporting raw data to external tools when the warehouse can do the work faster and cheaper.
Core techniques to reduce data movement
- Predicate pushdown: Push WHERE filters to the storage/warehouse so only needed row groups/partitions are read.
- Projection pushdown: Select only the columns you need. Avoid SELECT * on wide tables.
- Partition pruning: Partition by time or high-cardinality stable keys used for filtering. Query only relevant partitions.
- Data locality: Run transformations in the warehouse/lake engine containing the data (ELT pattern) instead of exporting to external scripts.
- Join strategy: Broadcast small dimensions; avoid large shuffles; co-locate and cluster/bucket on join keys where appropriate.
- Incremental processing: Use CDC or watermarks to process only new/changed data, not full reloads.
- Columnar formats & compression: Store as Parquet/ORC with compression to read fewer bytes over the wire.
- Materialized views / result reuse: Cache hot aggregates near compute to prevent re-scanning raw data.
- Late materialization: Read lightweight keys first; fetch heavy columns only for matched keys after filtering.
- Right-sized files: Avoid too many tiny files or excessively large ones to reduce overhead and retries.
Worked examples
Example 1: Predicate + projection pushdown on columnar data
Goal: Daily revenue for the last 7 days by country from a Parquet dataset partitioned by sale_date.
-- Good (pushdown + projection)
SELECT sale_date, country, SUM(amount) AS daily_revenue
FROM sales_parquet
WHERE sale_date >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY sale_date, country;
-- Bad (moves data): EXPORT all rows/cols to a script, then filter/aggregate there
- Reads only needed partitions (last 7 days).
- Selects only sale_date, country, amount.
- No external data export.
Example 2: Broadcast small dimension to avoid shuffling a huge fact
Goal: Spend by country for the last 30 days. Fact events: 2B rows. Dim users: 200K rows.
-- Warehouse SQL (engine decides broadcast automatically if enabled)
WITH f AS (
SELECT user_id, spend
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '30' DAY
)
SELECT u.country, SUM(f.spend) AS spend_30d
FROM f
JOIN users u USING (user_id)
GROUP BY u.country;
-- Spark-style hint (if applicable)
-- SELECT /*+ BROADCAST(u) */ ...
- Filters fact first (last 30 days) to shrink scanned data.
- Broadcasts small users table to avoid shuffling the 2B-row fact.
Example 3: Incremental MERGE (ELT) instead of full reload
Goal: Upsert only yesterday’s changes from a CDC feed into target table.
MERGE INTO target t
USING (
SELECT *
FROM cdc_orders
WHERE cdc_timestamp >= DATE_TRUNC('day', CURRENT_DATE) - INTERVAL '1' DAY
) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...)
;
- Processes only new/changed rows.
- Avoids re-reading full history each run.
Example 4: Late materialization
Goal: Identify top 10% customers by spend and then fetch their profiles (which include large text/JSON).
WITH ranked AS (
SELECT customer_id, SUM(spend) AS total
FROM fact_spend
WHERE txn_date >= CURRENT_DATE - INTERVAL '90' DAY
GROUP BY customer_id
), top_ids AS (
SELECT customer_id
FROM ranked
QUALIFY NTILE(10) OVER (ORDER BY total DESC) = 1
)
SELECT p.*
FROM customer_profiles p
JOIN top_ids t USING (customer_id);
- Scan and aggregate numeric columns first.
- Fetch heavy profile columns only for the small matched set.
Exercises
Do these to solidify the concepts. A quick test is available for everyone. Note: only logged-in users will have their test progress saved.
Exercise 1 — Pushdown and ELT rewrite
You currently export a full sales table to CSV and compute last 7 days revenue by country in a Python script. Rewrite the logic to run inside the warehouse or lake engine, reading the minimum partitions and columns.
- Keep compute where the data lives.
- Read only sale_date, country, amount.
- Filter sale_date to last 7 days before aggregation.
Tip: Checklist
- Is WHERE applied before any join/aggregation?
- Are only 3 columns selected?
- Is there any external export? (There shouldn't be.)
Exercise 2 — Join strategy to avoid shuffles
Compute 30-day spend by country from a huge events fact and a small users dimension. Ensure the plan avoids shuffling the big table across the network.
- Filter the fact first by date.
- Broadcast or co-locate the small dimension.
- Aggregate after the join.
Tip: Checklist
- Did you filter the fact before the join?
- Did you choose a broadcast or clustered join on user_id?
- No data export to external tools.
Common mistakes and self-check
- SELECT * on wide tables. Self-check: Can you list exactly which columns the downstream step needs?
- No partition filter. Self-check: Does your WHERE clause align with the partition column(s)?
- Exporting to external scripts for simple SQL tasks. Self-check: Can your warehouse do this join/agg directly?
- Joining first, filtering later. Self-check: Are you filtering big facts before the join?
- Full reloads for incremental tasks. Self-check: Do you have a reliable watermark or CDC feed?
- Broadcasting large tables. Self-check: Is the broadcast input truly small (e.g., < a few hundred MB)?
Practical projects
- Build a last-30-days revenue mart that uses partition pruning, projection pushdown, and a MERGE-based incremental load.
- Refactor a star-schema join to broadcast dimensions and measure reduction in shuffle bytes.
- Create a materialized daily_sales view and compare scan bytes vs a raw-table query.
- Implement late materialization for an enrichment step that adds heavy JSON columns only for matched IDs.
Learning path
- Identify the top 3 most expensive pipelines by scanned bytes or egress cost.
- Apply predicate and projection pushdown to each. Re-measure scan bytes.
- Optimize joins: broadcast small tables or use clustering/bucketing on keys.
- Switch full reloads to incremental via CDC or watermarks.
- Add materialized views or result caching for frequent aggregates.
Mini tasks to practice
- Remove SELECT * in two queries and list exact needed columns.
- Add a date filter that matches the table’s partition column.
- Enable or hint broadcast on a dimension < 200 MB and compare run time.
Next steps
- Run the exercises above on a real dataset and record scan bytes and duration before/after.
- Take the Quick Test below to check understanding. Everyone can take it; only logged-in users get saved progress.
Mini challenge
You have a daily job that exports full raw orders to a script, then joins with customers, filters for last 14 days, and aggregates by country. Redesign it to run fully in the warehouse, scanning only the last 14 days, broadcasting customers, and selecting only needed columns. Write the SQL and identify how many fewer bytes you expect to scan.