How to learn Joins And Aggregations for Data Transformations in ETL Developer for free

Why this matters

As an ETL Developer, you constantly combine datasets (joins) and compress data into KPIs (aggregations). Real tasks include:

Building daily revenue and active users from raw orders and events.
Enriching facts with dimensions (e.g., country, marketing channel) without double-counting.
Reconciling payments with orders, even when there are multiple payments per order.
Preparing clean, one-row-per-entity tables for downstream analysts.

Concept explained simply

Joins match rows from two tables using one or more keys. Aggregations turn many rows into fewer rows by grouping and computing metrics like SUM, COUNT, AVG, MIN, MAX.

Quick mental model

Join: like merging two lists by a shared label. Inner = matches only; Left = keep all from left, fill gaps; Full = keep all from both.
Aggregation: like compressing a spreadsheet by date or user and summing amounts.
Order of operations matters: aggregate early to control row multiplication after joins.

Core patterns

Pick the right join:
- INNER JOIN: only matched rows. Good for mandatory relationships.
- LEFT JOIN: keep all left rows. Good when enrichment may be missing.
- RIGHT/FULL JOIN: reconciliation across two systems.
- CROSS JOIN: Cartesian product. Use cautiously.
Control cardinality before joining:
- Pre-aggregate to one row per key to avoid double-counting.
- Deduplicate dimensions to a single record per business key (e.g., latest non-null).
Aggregate with intent:
- COUNT(*) counts rows; COUNT(DISTINCT x) counts unique values (slower on big data).
- SUM(CASE WHEN ...) builds conditional metrics.
- GROUP BY only columns that define your grain; everything else must be aggregated.
Nulls and types:
- Joins on NULL never match; COALESCE only when it preserves meaning.
- Match data types on both sides of join keys to prevent implicit casts and skew.

Worked examples

Example 1: Safe enrichment with a dimension

Task: Enrich orders with customer country, keeping all orders even if a customer record is missing.

-- Grain: one row per order
SELECT
  o.order_id,
  o.order_ts::date AS order_date,
  o.customer_id,
  o.amount AS gmv,
  c.country
FROM fact_orders o
LEFT JOIN dim_customers c
  ON o.customer_id = c.customer_id;

Why LEFT JOIN here?

Left keeps unmatched orders (data completeness). Analysts can still see orders with NULL country and decide how to handle them.

Example 2: Avoid many-to-many explosions

Problem: Joining orders to payments. Some orders have multiple payment records. Summing after a raw join double-counts.

-- Wrong: revenue doubled if two payments per order
SELECT SUM(o.amount) AS gmv_wrong
FROM fact_orders o
JOIN fct_payments p ON o.order_id = p.order_id;

-- Correct: pre-aggregate payments to one row per order
WITH payments_per_order AS (
  SELECT order_id, SUM(amount) AS amount_paid
  FROM fct_payments
  GROUP BY order_id
)
SELECT
  SUM(o.amount) AS gmv,
  SUM(CASE WHEN p.amount_paid >= o.amount THEN 1 ELSE 0 END) AS paid_orders
FROM fact_orders o
LEFT JOIN payments_per_order p ON o.order_id = p.order_id;

Self-check

Does the join input have one row per order on both sides? Yes.
Does total GMV change if you remove the join? It should not.

Example 3: Daily KPIs with COUNT DISTINCT

Task: Daily GMV and unique buyers.

SELECT
  o.order_ts::date AS order_date,
  SUM(o.amount) AS gmv,
  COUNT(DISTINCT o.customer_id) AS buyers
FROM fact_orders o
GROUP BY 1
ORDER BY 1;

Performance hint

COUNT(DISTINCT ...) can be expensive. On large data, pre-aggregate to one row per (order_date, customer_id) first, then count.

Example 4: Join to the latest dimension record

Task: Get each customer’s latest country and join to orders.

WITH latest_customer AS (
  SELECT DISTINCT ON (customer_id)
    customer_id,
    country,
    updated_at
  FROM dim_customers_hist
  WHERE country IS NOT NULL
  ORDER BY customer_id, updated_at DESC
)
SELECT o.order_id, o.amount, lc.country
FROM fact_orders o
LEFT JOIN latest_customer lc USING (customer_id);

Alternative if DISTINCT ON not available

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM dim_customers_hist
  WHERE country IS NOT NULL
)
, latest_customer AS (
  SELECT customer_id, country FROM ranked WHERE rn = 1
)
SELECT ...

Hands-on exercises

Do these locally or in any SQL-capable environment. Mirror your answers to the two exercises below. A quick test is available at the end.

Exercise 1 (ex1): Pre-aggregate payments, then join to orders to compute GMV and paid order count per customer.
Exercise 2 (ex2): Daily GMV and unique buyers by region using a LEFT JOIN to customers.

Checklist before you move on

Your joins produce the intended grain (e.g., one row per order or per day).
Removing a join doesn’t change a metric that shouldn’t depend on it (sanity test).
No unexpected row multiplication (row counts match expectations).
NULL handling is explicit (COALESCE only where meaningful).

Common mistakes and self-check

Many-to-many joins multiplying metrics. Fix: pre-aggregate to one row per key before joining.
Using INNER JOIN and accidentally dropping valid unmatched rows. Fix: use LEFT JOIN when enrichment may be missing.
Mismatched data types on join keys causing implicit casts and poor performance. Fix: cast explicitly and align types upstream.
Grouping by too many columns. Fix: decide the output grain first, then group only by those keys.
COUNT(*) vs COUNT(DISTINCT x) confusion. Fix: know what you’re counting; distinct is for unique entities, not rows.
Joining on dirty keys (leading/trailing spaces, case differences). Fix: standardize keys (TRIM/UPPER) in staging.

Self-check routine

Print row counts before and after each join.
Validate sums against a source-of-truth report for a small time window.
Spot-check a few keys manually to confirm one-to-one matches.

Mini challenge

Build a one-row-per-day table with:

gmv_total: SUM(order amount)
gmv_paid: SUM(order amount where amount_paid >= amount)
buyers: COUNT(DISTINCT customer_id)
paid_rate: gmv_paid / NULLIF(gmv_total, 0)

Acceptance criteria

No row multiplication from payments.
Paid rate between 0 and 1.
Dates have no gaps within the source date range.

Who this is for

ETL Developers and Data Engineers building pipelines and marts.
Analyst Engineers maintaining dimensional models.
Anyone moving from SQL basics to production-grade transformations.

Prerequisites

Basic SQL (SELECT, WHERE, GROUP BY, CASE).
Comfort with data types and NULL semantics.
Understanding of table grains (what one row represents).

Learning path

Master joins and aggregations (this lesson).
Then: window functions for running totals and latest-record joins.
Then: slowly changing dimensions and deduplication strategies.
Finally: performance tuning and incremental models.

Practical projects

Sales mart: facts, dimensions, daily KPIs, and a customer 360 rollup.
Payments reconciliation: detect under/overpaid orders by day.
Marketing attribution: join sessions to orders with careful grain control.

Next steps

Complete the exercises below.
Take the quick test to check mastery.
Apply the patterns to a real dataset you own.

Quick Test info

The test is available to everyone; only logged-in users get saved progress.

Menu

Joins And Aggregations

Table of Contents

Why this matters

Concept explained simply

Core patterns

Worked examples

Example 1: Safe enrichment with a dimension

Example 2: Avoid many-to-many explosions

Example 3: Daily KPIs with COUNT DISTINCT

Example 4: Join to the latest dimension record

Hands-on exercises

Common mistakes and self-check

Mini challenge

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Quick Test info

Practice Exercises

Pre-aggregate before join: Orders × Payments

Instructions

Expected Output

Daily GMV and buyers by region

Joins And Aggregations — Quick Test

Have questions about Joins And Aggregations?

AI Assistant