How to learn Joins And Groupby Patterns for Python (pandas and numpy) in Data Scientist for free

Why this matters

Data Scientists constantly join datasets (events with user profiles, transactions with products) and summarize them (revenue by segment, retention by cohort). Solid patterns for joins and groupby unlock faster analysis, fewer bugs, and clearer insights.

Customer analytics: join orders to customers, aggregate revenue and repeat rate.
Experiment analysis: join assignments to outcomes, groupby variant to compute lift.
Time-series context: as-of join sensor readings to latest calibration.

Who this is for

Data Scientists and Analysts who use pandas for daily analysis.
Engineers building data notebooks or quick reports.
Learners bridging SQL knowledge to pandas workflows.

Prerequisites

Basic Python (variables, functions, imports).
Pandas essentials: DataFrame/Series, selecting columns/rows.
Comfort reading tabular data (CSV/Parquet) into pandas.

Concept explained simply

Joins combine columns from two tables based on matching keys—like matching order.customer_id to customer.id. Groupby splits data into groups (by customer, product, date), applies an operation (sum, mean, count), and combines the result.

Mental model

Join = align rows by keys. Choose how to keep rows: inner (both), left (keep left), right, outer (all).
Groupby = split-apply-combine. Split by keys, apply aggregations or transforms, combine into a result aligned by groups.
Common patterns: many-to-one joins, anti-joins (things without matches), aggregate-then-join, transform for shares/ratios, as-of join for time.

Worked examples

1) Many-to-one join + aggregated customer revenue

import pandas as pd

customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'segment': ['Consumer', 'Corporate', 'Home Office']
})
orders = pd.DataFrame({
    'order_id': [101,102,103,104,105,106],
    'customer_id': [1,1,2,2,2,3],
    'amount': [120.0,80.0,200.0,100.0,50.0,0.0]
})

# Many-to-one left join: each order gets a customer segment
orders_enriched = orders.merge(customers, on='customer_id', how='left', validate='many_to_one')

# Aggregate by customer
cust_metrics = (
    orders_enriched
    .groupby(['customer_id','segment'], as_index=False)
    .agg(total_amount=('amount','sum'),
         order_count=('amount','size'),
         avg_amount=('amount','mean'))
    .sort_values('total_amount', ascending=False)
)
print(cust_metrics)

Tip: validate='many_to_one' catches unexpected duplicates in customers.

2) Semi-/Anti-join to find customers with no orders in last 90 days

import pandas as pd
from datetime import datetime, timedelta

now = pd.Timestamp('2024-01-01')
customers = pd.DataFrame({'customer_id':[1,2,3,4]})
orders = pd.DataFrame({
    'customer_id':[1,1,2],
    'order_date':[now - pd.Timedelta(days=10), now - pd.Timedelta(days=120), now - pd.Timedelta(days=20)]
})

# last order per customer
last_order = orders.groupby('customer_id', as_index=False)['order_date'].max().rename(columns={'order_date':'last_order'})

# left join to attach last_order
with_last = customers.merge(last_order, on='customer_id', how='left')

# anti-join logic (no orders in 90 days)
cutoff = now - pd.Timedelta(days=90)
churned = with_last[with_last['last_order'].isna() | (with_last['last_order'] < cutoff)]
print(churned)

Pattern: aggregate then left join, then filter for missing or older timestamps.

3) Transform for shares: product share within category

import pandas as pd

sales = pd.DataFrame({
    'sku':['A','B','C','B','C','A'],
    'category':['Snacks','Snacks','Drinks','Snacks','Drinks','Snacks'],
    'units':[10,30,20,5,30,5]
})

by_sku = sales.groupby(['sku','category'], as_index=False)['units'].sum()
by_sku['cat_total'] = by_sku.groupby('category')['units'].transform('sum')
by_sku['share'] = by_sku['units']/by_sku['cat_total']
print(by_sku)

transform keeps the original shape, letting you compute ratios per group without losing rows.

4) Time-aware merge: latest calibration via merge_asof

import pandas as pd

readings = pd.DataFrame({
    'ts': pd.to_datetime(['2024-01-01 10:00','2024-01-01 10:05','2024-01-01 10:10']),
    'value': [1.2, 1.3, 1.25]
}).sort_values('ts')
calib = pd.DataFrame({
    'ts': pd.to_datetime(['2024-01-01 09:55','2024-01-01 10:07']),
    'offset':[0.02, 0.01]
}).sort_values('ts')

# asof join: match each reading to the most recent prior calibration
adj = pd.merge_asof(readings, calib, on='ts', direction='backward')
adj['value_adj'] = adj['value'] - adj['offset']
print(adj)

Always sort by the key before merge_asof. direction='backward' matches the last known calibration.

Learning path

Practice core joins: inner, left, right, outer; validate shapes with row counts.
Master groupby: single and multi-key, named aggregations, as_index vs as_index=False.
Level up with patterns: anti-join, semi-join, transform-based features, aggregate-then-join.
Time-series joins: merge_asof and rolling groupby.
Performance: pre-sorting keys, categorical dtypes, avoiding accidental many-to-many joins.

Exercises

Mirror of the interactive exercises below. Use a fresh Python session with pandas imported.

Exercise 1 — Join and aggregate customer metrics

Goal: left-join orders to customers, then compute per-customer total_amount, order_count, and avg_amount. Sort by total_amount desc.

Starter code

import pandas as pd

customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'segment': ['Consumer', 'Corporate', 'Home Office']
})
orders = pd.DataFrame({
    'order_id': [101,102,103,104,105,106],
    'customer_id': [1,1,2,2,2,3],
    'amount': [120.0,80.0,200.0,100.0,50.0,0.0]
})

# Your code here

Use merge with validate='many_to_one'.
Use groupby with named aggregations.
Round avg_amount to 2 decimals for display if desired.

Expected output (shape and sample)

   customer_id      segment  total_amount  order_count  avg_amount
0            2     Corporate         350.0            3  116.666667
1            1      Consumer         200.0            2  100.000000
2            3  Home Office           0.0            1    0.000000

Exercise 2 — Anti-join + transform shares

Goal: find products with zero sales, then compute per-category share for sold SKUs.

Starter code

import pandas as pd

products = pd.DataFrame({
    'sku': ['A','B','C','D'],
    'category': ['Snacks','Snacks','Drinks','Drinks']
})
sales = pd.DataFrame({
    'sku': ['A','B','C','E'],
    'units': [10,30,20,5]
})

# Your code here

Anti-join: left merge products with sales, indicator=True, keep left_only rows (zero-sales).
Compute shares: join valid sales to products, groupby sku/category to sum units, transform sum per category, units/category_total.

Expected output (two parts)

Zero-sales SKUs:
  sku category
3   D   Drinks

Per-SKU shares among sold products:
  sku category  units  cat_total  share
0   A   Snacks     10         40   0.25
1   B   Snacks     30         40   0.75
2   C   Drinks     20         20   1.00

Checklist:
- Used validate in merges where shape assumptions exist.
- Confirmed groupby index behavior (as_index vs reset_index).
- Verified row counts before/after joins.
- Handled unexpected SKUs/customers gracefully.

Common mistakes and self-check

Mistake: Many-to-many join causes row explosion. Fix: check duplicates in keys; use merge(validate='many_to_one' or 'one_to_one').
Mistake: groupby returns index you didn't expect. Fix: pass as_index=False or reset_index() after.
Mistake: Using count instead of size. Fix: size counts all rows; count skips NaN. Choose intentionally.
Mistake: merge_asof without sorting. Fix: sort by key and set direction explicitly.
Mistake: Aggregating then losing keys. Fix: named aggregations with multi-keys preserve clarity.

Self-check routine

Row balance: after a left join, is result rows >= left rows? If < left, you used the wrong join type.
Key uniqueness: value_counts on join keys; if duplicates unexpected, fix upstream.
Spot-check: sample 5 groups and hand-calc sums/means to validate aggregation.
Null audit: after left joins, inspect columns from the right table for unexpected NaNs.

Practical projects

Customer 360 mini-mart: join users, orders, and support tickets; compute LTV, AOV, and ticket rate by segment.
Experiment readout: join assignment logs to outcomes; groupby variant and segment; compute lift with CIs.
Operational dashboard: as-of join machine readings to shifts; aggregate KPIs hourly and by line.

Next steps

Apply these patterns to one of your datasets; write assertions on shapes before/after joins.
Learn reshaping: pivot_table, melt, stack/unstack to complement groupby.
Explore window operations: rolling, expanding, and groupby.cum* patterns.

Mini challenge

Given events (user_id, ts, revenue) and users (user_id, signup_date, plan), compute:

Revenue in the first 30 days post-signup by plan.
Share of total revenue contributed by each plan.

Hints

Left join events to users, filter events within [signup, signup+30d).
Groupby plan to sum revenue; compute share using transform or by dividing by revenue.sum().

Quick Test

Short quiz to check your understanding. Available to everyone; log in to save your progress.

Menu

Joins And Groupby Patterns

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Worked examples

Learning path

Exercises

Exercise 1 — Join and aggregate customer metrics

Exercise 2 — Anti-join + transform shares

Common mistakes and self-check

Practical projects

Next steps

Mini challenge

Quick Test

Practice Exercises

Join and aggregate customer metrics

Instructions

Expected Output

Anti-join + transform shares

Joins And Groupby Patterns — Quick Test

Have questions about Joins And Groupby Patterns?

AI Assistant