Why this matters
As a Data Visualization Engineer, you rarely plot a single table. You blend facts and dimensions: orders with customers, sessions with campaigns, tickets with agents. Joining tables is how you shape the dataset your charts depend on.
- Build a customer 360 view by combining demographics, transactions, and support data.
- Create dashboard-ready datasets with the right grain (e.g., daily revenue by product).
- Avoid silent data errors like double-counting from many-to-many joins.
Who this is for
- Beginners who know basic SELECTs and filters.
- Analysts/BI and data viz practitioners moving beyond single-table queries.
- Engineers standardizing datasets for dashboards.
Prerequisites
- Basic SQL: SELECT, WHERE, GROUP BY, ORDER BY.
- Know what a primary key and foreign key are.
- Comfort reading small sample tables.
Concept explained simply
A JOIN combines rows from two tables when a condition matches (usually a shared key like customer_id). Think of it as lining up two spreadsheets by a matching column and gluing matching rows together.
Mental model
- INNER JOIN: Keep only pairs that match on both sides. Intersection.
- LEFT JOIN: Keep all rows from the left table; bring matches from the right; unmatched become NULL. Preserve the left.
- RIGHT JOIN: Mirror of LEFT JOIN. Less common; prefer LEFT for readability.
- FULL OUTER JOIN: Keep all rows from both sides; unmatched become NULL on the other side.
- CROSS JOIN: All combinations (cartesian product). Rare; use with caution.
Tip: USING vs ON
USING (key) is shorthand when both tables have the same column name for the join key. It removes duplicate key columns in the result. ON gives you full control and is often clearer in complex joins.
Worked examples
1) INNER JOIN customers to orders
Goal: show each order with the customer name.
-- Tables
customers(customer_id, customer_name)
orders(order_id, customer_id, order_total)
SELECT o.order_id, c.customer_name, o.order_total
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
Result: Only orders with a valid customer appear.
2) LEFT JOIN to preserve all customers
Goal: list all customers and their last order total, even if they never ordered.
SELECT c.customer_id, c.customer_name, o.order_total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
ORDER BY c.customer_id;
Customers with no orders will have NULL order_total.
3) Many-to-many join trap (double counting) and fix
Goal: revenue by product. Facts are at order_item grain; products is at product grain. If you join orders to order_items and then to products, summing totals from orders will overcount. Fix: aggregate first at the desired grain.
-- Wrong: multiplies order_total by number of items in the order
SELECT p.product_category, SUM(o.order_total) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
JOIN products p ON p.product_id = oi.product_id
GROUP BY p.product_category;
-- Correct: sum at item grain (or pre-aggregate), not order_total
SELECT p.product_category, SUM(oi.item_total) AS revenue
FROM order_items oi
JOIN products p ON p.product_id = oi.product_id
GROUP BY p.product_category;
Why this works
The correct query sums item_total which matches the grain of order_items. Alternatively, pre-aggregate order_items by order_id first, then join to orders (if you need order-level fields) without duplicating totals.
4) Self JOIN (bonus)
Goal: find pairs of customers in the same city.
SELECT a.customer_name AS customer_a, b.customer_name AS customer_b, a.city
FROM customers a
JOIN customers b ON a.city = b.city AND a.customer_id < b.customer_id;
Using a < prevents duplicates and self-pairs.
How to do it in practice
- Decide the target grain: one row per what? (customer, order, day, session)
- Choose the preserving table: which side must keep all rows? (LEFT JOIN that table)
- Identify join keys: sometimes composite keys like (order_id, product_id)
- Pre-aggregate detail tables to the target grain before joining
- Validate counts and sums before/after
Quick self-check checklist
- Did I accidentally multiply rows? Check row counts after each join.
- Do my sums match a trusted single-table source?
- Are NULLs expected? If not, investigate unmatched keys.
- Am I joining on the correct keys and data types?
Exercises
These mirror the graded exercises below. Do them here first, then open the collapsible solutions to compare.
Exercise 1 — INNER vs LEFT JOIN basics (matches ex1)
Use the sample data, write two queries: one INNER JOIN to list only customers with orders; one LEFT JOIN to list all customers with their order totals (NULL if none).
Sample data
customers
| customer_id | customer_name |
|---|---|
| 1 | Ada |
| 2 | Ben |
| 3 | Chen |
orders
| order_id | customer_id | order_total |
|---|---|---|
| 10 | 1 | 25.00 |
| 11 | 1 | 30.00 |
| 12 | 3 | 50.00 |
Show suggestion
- Use INNER JOIN for the first query.
- Use LEFT JOIN for the second. Consider SUM(order_total) GROUP BY customer.
Exercise 2 — Avoid double counting with pre-aggregation (matches ex2)
Compute total revenue per customer from order_items without double counting. Then show the top result.
Sample data
customers
| customer_id | customer_name |
|---|---|
| 1 | Ada |
| 2 | Ben |
orders
| order_id | customer_id |
|---|---|
| 100 | 1 |
| 101 | 1 |
| 102 | 2 |
order_items
| order_id | product_id | item_total |
|---|---|---|
| 100 | p1 | 20.00 |
| 100 | p2 | 5.00 |
| 101 | p3 | 15.00 |
| 102 | p2 | 12.00 |
Show suggestion
- Aggregate order_items by order_id first.
- Join aggregated totals to orders, then to customers.
- Checklist before submitting:
- Target grain is customer.
- No multiplication of order totals.
- Sums match manual calculation.
Common mistakes and self-checks
- Joining on names instead of IDs: text can vary; use stable keys.
- SUM of order_total after joining to order_items: leads to double counting. Sum item_total or pre-aggregate first.
- Filters in WHERE that nullify a LEFT JOIN: move filter to the join condition or explicitly allow NULL.
- Forgetting composite keys: join on all necessary columns (e.g., store_id and product_id).
Self-check queries
-- Row growth check
SELECT 'orders' AS t, COUNT(*) FROM orders
UNION ALL
SELECT 'orders_join_items', COUNT(*) FROM orders o JOIN order_items oi ON oi.order_id = o.order_id;
-- Duped key check (should be unique)
SELECT order_id, COUNT(*) c
FROM orders
GROUP BY 1
HAVING COUNT(*) > 1;
-- Anti-join to find unmatched keys
SELECT o.order_id
FROM orders o
LEFT JOIN customers c ON c.customer_id = o.customer_id
WHERE c.customer_id IS NULL;
Practical projects
- Sales dashboard dataset: Join customers, orders, order_items, products. Produce one row per day per product_category with revenue and order_count.
- Marketing attribution view: Join sessions, campaigns, and conversions. Preserve sessions with LEFT JOIN; compute conversion rate by channel.
- Support performance: Join tickets to agents and SLA targets. Flag breaches with a derived column.
Learning path
- Start: INNER and LEFT joins on single keys.
- Move to composite keys and self joins.
- Handle many-to-many with pre-aggregation.
- Add CASE expressions and derived metrics post-join.
- Optimize: indexes on join keys; limit columns.
Next steps
- Finish the exercises below and check solutions.
- Take the Quick Test to confirm understanding. Progress saving is available for logged-in users; the test itself is available to everyone.
- Apply joins to one of the Practical projects.
Mini challenge
You have tables: users(user_id, signup_date), events(user_id, event_date, event_type). Build a query that returns one row per user with: first_event_date, total_events, and a flag has_event_in_first_7_days (relative to signup). Use LEFT JOIN to preserve all users and be careful with aggregation.