Who this is for
Data Analysts who already use pandas basics and want confident, repeatable patterns for combining tables: merging on keys, joining on index, and stacking or aligning datasets.
Why this matters
- Create a customer 360 by merging customers, orders, and support tickets.
- Build weekly or monthly performance tables by concatenating CSV extracts.
- Enrich features by joining lookup tables (e.g., region names, product categories).
- Run anti-joins to find unmatched records for data quality checks.
Prerequisites
- Python basics and pandas DataFrame operations (selecting columns, filtering, basic dtype handling).
- Comfort with indices and column names.
Concept explained simply
Think of merging as matching rows based on keys (columns or index). Concat is stacking blocks either vertically (more rows) or horizontally (more columns). Join is a convenience method for index-based merges.
Mental model
Picture a Venn diagram of keys between two tables. The join type (inner/left/right/outer) decides which circles you include. Concat is like piling sheets of paper: row-wise adds more pages; column-wise adds more columns side by side.
Core patterns and syntax
Merge by columns (most common)
import pandas as pd
# Basic
df = pd.merge(left, right, on="key", how="left") # inner | left | right | outer
# Different column names
pd.merge(left, right, left_on="lkey", right_on="rkey", how="inner")
# Disambiguate overlapping column names
pd.merge(a, b, on="id", how="left", suffixes=("_a", "_b"))
# Check relationship to avoid row explosions
pd.merge(a, b, on="id", how="left", validate="one_to_one") # or "one_to_many", "many_to_one", "many_to_many"
# Inspect matches
pd.merge(a, b, on="id", how="left", indicator=True)
Join by index
# Set index, then join
lookup = lookup.set_index("key")
main = main.set_index("key")
main.join(lookup, how="left") # inner | left | right | outer
# Or with merge using index
a.merge(b, left_index=True, right_index=True, how="inner")
Concat (stack or align)
# Row-wise: stack more rows of same schema
pd.concat([jan, feb, mar], axis=0, ignore_index=True)
# Column-wise: align on index to add new columns
pd.concat([metrics, segments], axis=1)
# Track origins with keys to build a MultiIndex
pd.concat([jan, feb], keys=["Jan", "Feb"], axis=0)
Note: DataFrame.append is deprecated in modern pandas. Use pd.concat([df1, df2], ignore_index=True).
Append (modern pattern)
Use concat instead of append:
# Instead of: df1.append(df2)
combined = pd.concat([df1, df2], ignore_index=True)
Worked examples
1) Customer to Orders (left merge with indicator)
import pandas as pd
customers = pd.DataFrame({
"customer_id": [101, 102, 103, 104],
"name": ["Ava", "Ben", "Cara", "Drew"]
})
orders = pd.DataFrame({
"customer_id": ["101", "101", "103", "105"], # string ids
"order_id": [1, 2, 3, 4],
"amount": [20, 35, 15, 50]
})
# Align dtypes for safe merge
orders["customer_id"] = orders["customer_id"].astype(int)
out = pd.merge(
customers, orders,
on="customer_id", how="left",
indicator=True, suffixes=("", "_order")
).sort_values(["customer_id", "order_id"], na_position="last")
print(out)
Result includes all customers; unmatched show NaN and _merge='left_only'.
2) Join features by index (cleaner than re-specifying keys)
users = pd.DataFrame({"id": [1,2,3], "name": ["Ana","Bo","Cy"]}).set_index("id")
regions = pd.DataFrame({"id": [1,2,4], "region": ["NA","EU","APAC"]}).set_index("id")
# Left join: keep all users
features = users.join(regions, how="left")
print(features)
Users without a region stay in the result; region is NaN.
3) Concatenate monthly datasets (row-wise) and tag origin
jan = pd.DataFrame({"session_id":[1,2,3], "user":["A","B","C"], "events":[5,2,7]})
feb = pd.DataFrame({"session_id":[3,4], "user":["C","D"], "events":[4,1]})
# Add a month column and stack
jan2 = jan.assign(month="Jan")
feb2 = feb.assign(month="Feb")
all_sessions = pd.concat([jan2, feb2], axis=0, ignore_index=True)
print(all_sessions)
Now you can group by user or month, and sessions persist across months.
Practice exercises
Solve these inside your notebook. The same tasks appear below with a solution toggle.
Exercise 1 — Left merge customers and orders (with dtype fix)
Create the two DataFrames below and produce the expected output. Use a left merge on customer_id, align dtypes, include indicator, sort by customer_id then order_id.
import pandas as pd
customers = pd.DataFrame({
"customer_id": [101, 102, 103, 104],
"name": ["Ava", "Ben", "Cara", "Drew"]
})
orders = pd.DataFrame({
"customer_id": ["101", "101", "103", "105"],
"order_id": [1, 2, 3, 4],
"amount": [20, 35, 15, 50]
})
Expected output:
customer_id name order_id amount _merge
101 Ava 1 20 both
101 Ava 2 35 both
102 Ben NaN NaN left_only
103 Cara 3 15 both
104 Drew NaN NaN left_only
See hints
- Convert orders.customer_id to int before merging.
- Use pd.merge with how='left', indicator=True.
- Sort values by ['customer_id','order_id'].
Exercise 2 — Concat row-wise and tag month
Make these DataFrames and stack them row-wise. Add a month column to each before concatenation and ignore the index.
jan = pd.DataFrame({"session_id":[1,2,3], "user":["A","B","C"], "events":[5,2,7]})
feb = pd.DataFrame({"session_id":[3,4], "user":["C","D"], "events":[4,1]})
Expected output:
session_id user events month
1 A 5 Jan
2 B 2 Jan
3 C 7 Jan
3 C 4 Feb
4 D 1 Feb
See hints
- Use assign(month='Jan') and assign(month='Feb').
- pd.concat([...], axis=0, ignore_index=True)
Checklist before you continue
- I know when to use merge vs join vs concat.
- I can safely prevent row explosions with validate=...
- I can track match results with indicator=True.
- I can replace append with pd.concat consistently.
- I can align dtypes before merging.
Common mistakes and self-check
- Using DataFrame.append: In modern pandas, use pd.concat instead.
- Mismatched dtypes: Keys as int vs str cause zero matches. Self-check: customers.customer_id.dtype equals orders.customer_id.dtype.
- Accidental many-to-many merges: Unintended row multiplication. Self-check: use validate to enforce expected cardinality.
- Forgetting suffixes: Overlapping columns silently overwrite. Use suffixes=("_left","_right").
- Column-wise concat without aligned index: Misaligned rows. Ensure indices match or reset/reindex explicitly.
- Outer merges without handling NaN: Follow up with fillna or explicit checks.
Mini challenge
Sales enrichment pipeline:
- Concat three months of transactions (Jan, Feb, Mar) row-wise and tag month.
- Merge with a product lookup to add category and list_price.
- Add a flag price_mismatch = (paid_price != list_price) and compute mismatch rate by month.
- Guard with validate to avoid many-to-many merges.
Tip
Set product lookup to index by product_id and use left joins from transactions.
Practical projects
- Marketing attribution table: concat daily campaign logs, then left-merge user attributes to analyze performance by segment.
- Data quality dashboard: anti-join customer master to orders to list orphan orders or orphan customers.
- Feature store prototype: join multiple lookups (region, plan, cohort) to a users table; ensure cardinalities with validate.
Learning path
- After joins/concat: GroupBy and aggregation to summarize the combined data.
- Then: Pivot tables and reshaping (melt/pivot) for reporting.
- Finally: Data validation patterns (assertions, dtype checks) to productionize notebooks.
Next steps
- Refactor your current project to replace append with concat and add validate to critical merges.
- Create reusable helper functions for common joins (e.g., join_products(df)).
- Document your merge keys, expected cardinalities, and post-merge row counts.
Quick Test
Take the quick test below. Available to everyone. Only logged-in users get saved progress.