How to learn Joining Multiple Data Sources for SQL in Data Scientist for free

Why this matters

Data Scientists regularly merge tables from product analytics, marketing, finance, and data warehouses. Good joins give you accurate features, cohorts, and metrics. Bad joins double-count users, drop revenue, or skew models. Mastering joins means you can confidently assemble the exact dataset you need.

Build training datasets from multiple tables (events, users, subscriptions).
Answer business questions (conversion by channel, LTV by cohort).
Create clean features (latest plan, last payment date, total orders).

Typical real tasks

Join customers to orders to compute AOV.
Combine web and in-store orders into one fact table.
Match transactions to the active price plan at the time of purchase.
Attach marketing channel to each user via a mapping table.

Concept explained simply

A SQL join lines up rows from two or more tables where key values match. Think of laying two spreadsheets side by side and keeping rows that correspond to the same entity (like a user_id). When there are multiple tables, you connect them one step at a time, usually via keys or bridge tables.

Mental model

Start table: the “story you’re telling” (e.g., orders).
Lookup table(s): add attributes (e.g., customers, products).
Bridge table(s): resolve many-to-many (e.g., user_channel_map).
Filters: push them early to reduce row explosion and speed up queries.

Join types in one sentence each

INNER: keep only matches on both sides.
LEFT: keep all from left, fill unmatched right with NULLs.
RIGHT: mirror of LEFT (less common).
FULL: keep everything from both; unmatched sides are NULL.
SEMI (EXISTS): keep left rows that have a match, but don’t bring columns.
ANTI (NOT EXISTS): keep left rows with no match on the right.

Core patterns and syntax

-- Basic multi-table join
SELECT o.order_id, c.customer_name, p.product_name
FROM orders o
JOIN order_items oi   ON o.order_id = oi.order_id
JOIN products p       ON oi.product_id = p.product_id
LEFT JOIN customers c ON o.customer_id = c.customer_id;

-- Pre-aggregate before join to avoid duplication
WITH payment_sum AS (
  SELECT order_id, SUM(amount) AS paid
  FROM payments
  GROUP BY order_id
)
SELECT o.order_id, o.total_amount, ps.paid
FROM orders o
LEFT JOIN payment_sum ps ON o.order_id = ps.order_id;

-- Date-range (temporal) join: active plan at purchase time
SELECT t.txn_id, p.plan_name
FROM transactions t
JOIN plans p
  ON t.user_id = p.user_id
 AND t.txn_ts >= p.valid_from
 AND (p.valid_to IS NULL OR t.txn_ts < p.valid_to);

-- Semi-join: users who purchased
SELECT u.user_id
FROM users u
WHERE EXISTS (
  SELECT 1 FROM orders o WHERE o.user_id = u.user_id
);

-- Anti-join: users who never purchased
SELECT u.user_id
FROM users u
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.user_id = u.user_id
);

Worked examples

Example 1: Orders with customer and product details

Goal: Each order row with its customer name and the list of product names.

Solution idea

Join orders → order_items → products, and left join customers. Aggregate product names to avoid duplication (or return one row per item if needed).

-- One row per order with concatenated product names (syntax may vary by SQL dialect)
WITH items AS (
  SELECT o.order_id,
         STRING_AGG(p.product_name, ', ') AS products
  FROM orders o
  JOIN order_items oi ON o.order_id = oi.order_id
  JOIN products p     ON oi.product_id = p.product_id
  GROUP BY o.order_id
)
SELECT o.order_id, c.customer_name, i.products
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
LEFT JOIN items i     ON o.order_id = i.order_id;

Why this works

Pre-aggregating product names blocks the row explosion that happens when joining multiple child tables directly.

Example 2: Revenue with payments joined once

Goal: Show total order amount and amount paid. Payments table can contain multiple rows per order.

WITH payment_sum AS (
  SELECT order_id, SUM(amount) AS paid
  FROM payments
  GROUP BY order_id
)
SELECT o.order_id, o.total_amount, COALESCE(ps.paid, 0) AS paid
FROM orders o
LEFT JOIN payment_sum ps ON ps.order_id = o.order_id;

Key lesson

Always aggregate on the many-side before joining when you need one row per entity on the one-side.

Example 3: Many-to-many via bridge (marketing channel)

Goal: Attach a primary acquisition channel to each user using a mapping table that can have multiple rows per user.

-- Choose the earliest channel by timestamp as primary
WITH ranked AS (
  SELECT ucm.user_id, ucm.channel, ucm.first_seen_ts,
         ROW_NUMBER() OVER (PARTITION BY ucm.user_id ORDER BY ucm.first_seen_ts) AS rn
  FROM user_channel_map ucm
)
SELECT u.user_id, r.channel AS primary_channel
FROM users u
LEFT JOIN ranked r
  ON u.user_id = r.user_id AND r.rn = 1;

Key lesson

Use window functions to pick a single row from a many-to-many relationship before joining to the main table.

Practical steps

Define the grain: Decide the target row-level (e.g., one row per order). Everything else must respect this.
Map relationships: Draw which tables connect and via which keys.
Pre-aggregate many-sides: SUM/COUNT or window to one row per key before joins.
Join in layers: Start with the main fact table, then add lookups and bridges.
Validate counts: Compare row counts before/after joins and totals to catch duplicates or losses.

Mini checklist

Grain is explicit and consistent.
All join keys use matching data types.
Filters applied in the right place (ON vs WHERE).
Many-sides are pre-aggregated or windowed.
Row counts and sums match expectations.

Exercises

Use the following simplified schemas in mind:

customers(customer_id, customer_name, signup_ts)
orders(order_id, customer_id, order_ts, total_amount)
order_items(order_id, product_id, qty, unit_price)
products(product_id, product_name, category)
payments(payment_id, order_id, amount, paid_ts)
user_channel_map(user_id, channel, first_seen_ts)

Exercise 1 (ex1): AOV by primary channel

Create a query with one row per channel that shows:

orders_count
aov (average order value = total_amount average)

Define primary channel as the earliest record in user_channel_map per customer (assume user_id matches customer_id). Include channels with zero orders.

Hints

Use ROW_NUMBER() to choose a primary channel per user.
Left join channels to aggregated orders by customer/channel.
COUNT(*) and AVG(total_amount) on the appropriate grain.

Show solution

WITH primary_channel AS (
  SELECT user_id AS customer_id, channel,
         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY first_seen_ts) AS rn
  FROM user_channel_map
),
channels AS (
  SELECT customer_id, channel FROM primary_channel WHERE rn = 1
),
orders_by_cust AS (
  SELECT o.customer_id, COUNT(*) AS orders_cnt, AVG(o.total_amount) AS aov
  FROM orders o
  GROUP BY o.customer_id
)
SELECT c.channel,
       COALESCE(SUM(oc.orders_cnt), 0) AS orders_count,
       AVG(oc.aov) FILTER (WHERE oc.aov IS NOT NULL) AS aov
FROM channels c
LEFT JOIN orders_by_cust oc ON oc.customer_id = c.customer_id
GROUP BY c.channel
ORDER BY c.channel;

Expected output (shape):

channel | orders_count | aov
------- | ------------ | ----
ads     | 120          | 58.30
email   |  95          | 62.10
referral|  40          | 71.55

Exercise 2 (ex2): Order payment coverage

Produce a table with one row per order: order_id, total_amount, paid_amount, is_fully_paid (Y/N). Consider multiple payment rows per order.

Hints

Pre-aggregate payments by order_id.
Use COALESCE to handle orders without payments.
Compare paid_amount vs total_amount.

Show solution

WITH payment_sum AS (
  SELECT order_id, SUM(amount) AS paid_amount
  FROM payments
  GROUP BY order_id
)
SELECT o.order_id,
       o.total_amount,
       COALESCE(ps.paid_amount, 0) AS paid_amount,
       CASE WHEN COALESCE(ps.paid_amount, 0) >= o.total_amount THEN 'Y' ELSE 'N' END AS is_fully_paid
FROM orders o
LEFT JOIN payment_sum ps ON ps.order_id = o.order_id
ORDER BY o.order_id;

Expected output (shape):

order_id | total_amount | paid_amount | is_fully_paid
-------- | ------------ | ----------- | -------------
1001     | 120.00       | 120.00      | Y
1002     |  80.00       |  50.00      | N
1003     |  65.00       |   0.00      | N

Self-check after exercises

Did any join unexpectedly multiply rows? Compare counts against expectations.
Are NULLs handled where matches may not exist?
Did you pre-aggregate many-to-one relationships before joining?
Did filters in WHERE accidentally turn a LEFT join into an INNER join?

Common mistakes and how to self-check

Row explosion from many-to-many joins. Fix: Pre-aggregate or window to one row per key before joining.
Filtering in WHERE after a LEFT join on right-table columns, which nullifies the LEFT. Fix: Move the condition into the ON clause or use IS NOT NULL carefully.
Mismatched data types on join keys (string vs int). Fix: CAST keys to a common type.
Joining on names instead of IDs. Fix: Use stable keys; add a mapping table if needed.
Not defining the grain. Fix: Write the intended grain at the top of the query as a comment and design joins accordingly.

Quick self-audit

What is my target grain?
Which joins can multiply rows? How did I prevent it?
Are all filters in the correct place?
Do totals match source-of-truth numbers?

Practical projects

Customer 360 view: Build a single table keyed by customer_id with latest plan, total orders, lifetime spend, last seen channel.
Marketing attribution snapshot: For each order, attach acquisition channel and compute revenue by channel.
Product performance mart: One row per product with units sold, revenue, unique buyers, and repeat purchase rate.

Implementation tips

Stage with CTEs: one CTE per source and pre-aggregation.
Write assertions as SELECT checks (counts, sums) between CTEs.
Keep queries composable: start small, add joins one by one.

Mini challenge

Build a query returning one row per customer with: total_orders, lifetime_value (sum total_amount), first_order_ts, last_order_ts, and primary_channel (earliest channel). Ensure customers with no orders still appear with zero totals.

Show reference solution

WITH primary_channel AS (
  SELECT user_id AS customer_id, channel,
         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY first_seen_ts) AS rn
  FROM user_channel_map
), pc AS (
  SELECT customer_id, channel FROM primary_channel WHERE rn = 1
), agg_orders AS (
  SELECT customer_id,
         COUNT(*) AS total_orders,
         COALESCE(SUM(total_amount),0) AS lifetime_value,
         MIN(order_ts) AS first_order_ts,
         MAX(order_ts) AS last_order_ts
  FROM orders
  GROUP BY customer_id
)
SELECT c.customer_id,
       COALESCE(ao.total_orders, 0) AS total_orders,
       COALESCE(ao.lifetime_value, 0) AS lifetime_value,
       ao.first_order_ts,
       ao.last_order_ts,
       pc.channel AS primary_channel
FROM customers c
LEFT JOIN ao ON ao.customer_id = c.customer_id
LEFT JOIN pc ON pc.customer_id = c.customer_id;

Who this is for

Data Scientists and Analysts who assemble datasets from multiple tables.
Engineers building feature stores or analytics marts.
Anyone preparing data for modeling, dashboards, or experiments.

Prerequisites

Comfort with SELECT, WHERE, GROUP BY, and basic joins.
Basic understanding of primary/foreign keys and one-to-many vs many-to-many relationships.
Familiarity with window functions is a plus.

Learning path

Review join types and ON vs WHERE behavior.
Practice multi-table joins with careful grain definition.
Use CTEs for pre-aggregation and row selection via window functions.
Validate results with row counts and control totals.
Tackle temporal joins and semi/anti joins.

Progress note

The Quick Test is available to everyone. If you log in, your progress and results are saved.

Next steps

Re-implement the examples using your data warehouse dialect.
Add unit checks: compare totals before and after joins.
Move on to window functions and advanced aggregations.

Menu

Joining Multiple Data Sources

Table of Contents

Why this matters

Concept explained simply

Mental model

Core patterns and syntax

Worked examples

Example 1: Orders with customer and product details

Example 2: Revenue with payments joined once

Example 3: Many-to-many via bridge (marketing channel)

Practical steps

Exercises

Common mistakes and how to self-check

Practical projects

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Practice Exercises

AOV by Primary Channel

Instructions

Expected Output

Order Payment Coverage

Joining Multiple Data Sources — Quick Test

Have questions about Joining Multiple Data Sources?

AI Assistant