Why this matters
As an ETL Developer, you constantly combine datasets (joins) and compress data into KPIs (aggregations). Real tasks include:
- Building daily revenue and active users from raw orders and events.
- Enriching facts with dimensions (e.g., country, marketing channel) without double-counting.
- Reconciling payments with orders, even when there are multiple payments per order.
- Preparing clean, one-row-per-entity tables for downstream analysts.
Concept explained simply
Joins match rows from two tables using one or more keys. Aggregations turn many rows into fewer rows by grouping and computing metrics like SUM, COUNT, AVG, MIN, MAX.
Quick mental model
- Join: like merging two lists by a shared label. Inner = matches only; Left = keep all from left, fill gaps; Full = keep all from both.
- Aggregation: like compressing a spreadsheet by date or user and summing amounts.
- Order of operations matters: aggregate early to control row multiplication after joins.
Core patterns
- Pick the right join:
- INNER JOIN: only matched rows. Good for mandatory relationships.
- LEFT JOIN: keep all left rows. Good when enrichment may be missing.
- RIGHT/FULL JOIN: reconciliation across two systems.
- CROSS JOIN: Cartesian product. Use cautiously.
- Control cardinality before joining:
- Pre-aggregate to one row per key to avoid double-counting.
- Deduplicate dimensions to a single record per business key (e.g., latest non-null).
- Aggregate with intent:
- COUNT(*) counts rows; COUNT(DISTINCT x) counts unique values (slower on big data).
- SUM(CASE WHEN ...) builds conditional metrics.
- GROUP BY only columns that define your grain; everything else must be aggregated.
- Nulls and types:
- Joins on NULL never match; COALESCE only when it preserves meaning.
- Match data types on both sides of join keys to prevent implicit casts and skew.
Worked examples
Example 1: Safe enrichment with a dimension
Task: Enrich orders with customer country, keeping all orders even if a customer record is missing.
-- Grain: one row per order
SELECT
o.order_id,
o.order_ts::date AS order_date,
o.customer_id,
o.amount AS gmv,
c.country
FROM fact_orders o
LEFT JOIN dim_customers c
ON o.customer_id = c.customer_id;
Why LEFT JOIN here?
Left keeps unmatched orders (data completeness). Analysts can still see orders with NULL country and decide how to handle them.
Example 2: Avoid many-to-many explosions
Problem: Joining orders to payments. Some orders have multiple payment records. Summing after a raw join double-counts.
-- Wrong: revenue doubled if two payments per order
SELECT SUM(o.amount) AS gmv_wrong
FROM fact_orders o
JOIN fct_payments p ON o.order_id = p.order_id;
-- Correct: pre-aggregate payments to one row per order
WITH payments_per_order AS (
SELECT order_id, SUM(amount) AS amount_paid
FROM fct_payments
GROUP BY order_id
)
SELECT
SUM(o.amount) AS gmv,
SUM(CASE WHEN p.amount_paid >= o.amount THEN 1 ELSE 0 END) AS paid_orders
FROM fact_orders o
LEFT JOIN payments_per_order p ON o.order_id = p.order_id;
Self-check
- Does the join input have one row per order on both sides? Yes.
- Does total GMV change if you remove the join? It should not.
Example 3: Daily KPIs with COUNT DISTINCT
Task: Daily GMV and unique buyers.
SELECT
o.order_ts::date AS order_date,
SUM(o.amount) AS gmv,
COUNT(DISTINCT o.customer_id) AS buyers
FROM fact_orders o
GROUP BY 1
ORDER BY 1;
Performance hint
COUNT(DISTINCT ...) can be expensive. On large data, pre-aggregate to one row per (order_date, customer_id) first, then count.
Example 4: Join to the latest dimension record
Task: Get each customer’s latest country and join to orders.
WITH latest_customer AS (
SELECT DISTINCT ON (customer_id)
customer_id,
country,
updated_at
FROM dim_customers_hist
WHERE country IS NOT NULL
ORDER BY customer_id, updated_at DESC
)
SELECT o.order_id, o.amount, lc.country
FROM fact_orders o
LEFT JOIN latest_customer lc USING (customer_id);
Alternative if DISTINCT ON not available
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM dim_customers_hist
WHERE country IS NOT NULL
)
, latest_customer AS (
SELECT customer_id, country FROM ranked WHERE rn = 1
)
SELECT ...
Hands-on exercises
Do these locally or in any SQL-capable environment. Mirror your answers to the two exercises below. A quick test is available at the end.
- Exercise 1 (ex1): Pre-aggregate payments, then join to orders to compute GMV and paid order count per customer.
- Exercise 2 (ex2): Daily GMV and unique buyers by region using a LEFT JOIN to customers.
Checklist before you move on
- Your joins produce the intended grain (e.g., one row per order or per day).
- Removing a join doesn’t change a metric that shouldn’t depend on it (sanity test).
- No unexpected row multiplication (row counts match expectations).
- NULL handling is explicit (COALESCE only where meaningful).
Common mistakes and self-check
- Many-to-many joins multiplying metrics. Fix: pre-aggregate to one row per key before joining.
- Using INNER JOIN and accidentally dropping valid unmatched rows. Fix: use LEFT JOIN when enrichment may be missing.
- Mismatched data types on join keys causing implicit casts and poor performance. Fix: cast explicitly and align types upstream.
- Grouping by too many columns. Fix: decide the output grain first, then group only by those keys.
- COUNT(*) vs COUNT(DISTINCT x) confusion. Fix: know what you’re counting; distinct is for unique entities, not rows.
- Joining on dirty keys (leading/trailing spaces, case differences). Fix: standardize keys (TRIM/UPPER) in staging.
Self-check routine
- Print row counts before and after each join.
- Validate sums against a source-of-truth report for a small time window.
- Spot-check a few keys manually to confirm one-to-one matches.
Mini challenge
Build a one-row-per-day table with:
- gmv_total: SUM(order amount)
- gmv_paid: SUM(order amount where amount_paid >= amount)
- buyers: COUNT(DISTINCT customer_id)
- paid_rate: gmv_paid / NULLIF(gmv_total, 0)
Acceptance criteria
- No row multiplication from payments.
- Paid rate between 0 and 1.
- Dates have no gaps within the source date range.
Who this is for
- ETL Developers and Data Engineers building pipelines and marts.
- Analyst Engineers maintaining dimensional models.
- Anyone moving from SQL basics to production-grade transformations.
Prerequisites
- Basic SQL (SELECT, WHERE, GROUP BY, CASE).
- Comfort with data types and NULL semantics.
- Understanding of table grains (what one row represents).
Learning path
- Master joins and aggregations (this lesson).
- Then: window functions for running totals and latest-record joins.
- Then: slowly changing dimensions and deduplication strategies.
- Finally: performance tuning and incremental models.
Practical projects
- Sales mart: facts, dimensions, daily KPIs, and a customer 360 rollup.
- Payments reconciliation: detect under/overpaid orders by day.
- Marketing attribution: join sessions to orders with careful grain control.
Next steps
- Complete the exercises below.
- Take the quick test to check mastery.
- Apply the patterns to a real dataset you own.
Quick Test info
The test is available to everyone; only logged-in users get saved progress.