Why this matters
As a Data Scientist, most of your time is spent preparing data. Pandas is the workhorse you use to:
- Load messy CSVs, spreadsheets, and JSON into tidy DataFrames.
- Fix types, parse dates, standardize text, and handle missing values.
- Merge data from multiple sources reliably.
- Aggregate, reshape, and create features for modeling.
- Build repeatable cleaning pipelines that other people (and future you) can trust.
Who this is for
Beginners to intermediate Python users who want a solid, job-ready workflow for preparing datasets for analysis and machine learning.
Prerequisites
- Basic Python (variables, functions, lists, dicts).
- Intro pandas (DataFrame, Series, selecting columns).
- Comfort reading simple error messages.
Learning path
- Learn the mental model of DataFrames and tidy data.
- Load and inspect data safely.
- Clean values and types; handle missing data.
- Join datasets without surprises.
- Aggregate and reshape for analysis or modeling.
- Package steps into clear, reproducible pipelines.
Concept explained simply
Think of a DataFrame like a spreadsheet with superpowers: every column has a type (number, text, datetime), and you can transform large datasets with a few vectorized operations.
Mental model: Data wrangling is moving from raw to reliable. You take raw tables, apply a series of small, named steps (clean, type, join, shape), and end with a tidy table ready for charts or models.
Key ideas to remember
- Tidy data: each variable is a column, each observation a row, each type of observational unit a table.
- Use method chaining (df.something().something_else()) to express a clear pipeline.
- Prefer vectorized operations over row-by-row loops for speed and clarity.
Worked examples
Example 1 — Load and inspect safely
import pandas as pd
from io import StringIO
raw = StringIO(
"""order_id,date,item,price,qty
1001,2023-01-02, Widget A ,$12.5,2
1002,01/03/2023,Widget B,$8.00,3
1003,2023/01/04,widget a,$12.50,
1004,,Widget C,$5,1
"""
)
df = (pd.read_csv(raw)
.assign(
date=lambda d: pd.to_datetime(d['date'], errors='coerce'),
item=lambda d: d['item'].str.strip().str.lower(),
price=lambda d: d['price'].str.replace('$','', regex=False).astype(float),
qty=lambda d: pd.to_numeric(d['qty'], errors='coerce')
))
print(df.info())
print(df.head())We parsed dates, cleaned text/currency, and made numeric columns numeric. errors='coerce' prevents crashes and marks bad values as NaT/NaN.
Example 2 — Handle missing values
# Decide policy: quantity missing -> assume 0; date missing -> drop row
clean = (df
.assign(qty=lambda d: d['qty'].fillna(0))
.dropna(subset=['date'])
)
# Create revenue feature
clean = clean.assign(revenue=lambda d: d['price'] * d['qty'])
print(clean)Define missing-data rules explicitly in code so they are auditable.
Example 3 — Merge datasets
customers = pd.DataFrame({
'customer_id': [1,2,3],
'country': ['US','DE','US']
})
orders = pd.DataFrame({
'order_id':[1001,1002,1005],
'customer_id':[1,2,4] # Note: customer 4 missing in customers
})
merged = orders.merge(customers, on='customer_id', how='left', indicator=True)
print(merged)
# Use _merge to audit join quality
print(merged['_merge'].value_counts())Use indicator=True to see which rows matched. Left join keeps all orders and shows missing customer info clearly.
Example 4 — Aggregate and reshape
sales = pd.DataFrame({
'country':['US','US','DE','DE'],
'item':['widget a','widget b','widget a','widget b'],
'revenue':[100,60,80,50]
})
# Aggregate revenue per country
agg = sales.groupby('country', as_index=False)['revenue'].sum()
print(agg)
# Pivot to compare items by country
pivot = sales.pivot_table(index='country', columns='item', values='revenue', aggfunc='sum', fill_value=0)
print(pivot)Groupby reduces rows; pivot reshapes long to wide for side-by-side comparison.
Example 5 — Build a readable pipeline
def clean_prices(s):
return (s.str.replace('$','', regex=False)
.str.replace(',','', regex=False)
.astype(float))
pipeline = (
pd.DataFrame({
'date':['2023-03-01','2023-03-02','bad'],
'price':['$1,200.50','$99.90','$5'],
'qty':[1,2,None],
'item':[' A ','b','A']
})
.assign(
date=lambda d: pd.to_datetime(d['date'], errors='coerce'),
price=lambda d: clean_prices(d['price']),
qty=lambda d: d['qty'].fillna(0).astype(int),
item=lambda d: d['item'].str.strip().str.lower().astype('category')
)
.dropna(subset=['date'])
)
print(pipeline.dtypes)
print(pipeline)Small helper functions and assign keep your steps explicit and testable.
Exercises
These mirror the coding exercises below. Try them before opening solutions. Tip: Copy the sample data exactly.
Exercise 1 — Clean messy sales CSV
Goal: read a CSV, standardize types, and compute revenue.
- Read from a string buffer.
- Strip item names and lowercase.
- Convert price to float (remove currency symbol).
- Convert qty to numeric; missing -> 0.
- Parse date; drop rows with invalid date.
- Add revenue = price * qty.
csv = """order_id,date,item,price,qty
2001,2023-02-01, Widget A ,$10.0,1
2002,02/03/2023,widget B,$7.5,2
2003,,Widget C,$5,
"""
# Produce a DataFrame with columns: order_id(int), date(datetime64), item(str lower), price(float), qty(int), revenue(float)Exercise 2 — Join and aggregate
Goal: merge customers and orders, then compute revenue per country.
- Left join orders to customers on customer_id.
- Flag unmatched rows.
- Aggregate revenue by country (missing country -> label as 'UNKNOWN').
orders_csv = """order_id,customer_id,revenue
3001,1,20
3002,2,15
3003,4,8
"""
customers_csv = """customer_id,country
1,US
2,FR
3,DE
"""Exercise 3 — Reshape wide to long (melt) and back (pivot)
Goal: tidy monthly sales.
- Start wide with columns Jan, Feb, Mar.
- Melt to long with columns: item, month, sales.
- Ensure month is ordered categorical.
- Pivot back to wide; fill missing with 0.
import pandas as pd
wide = pd.DataFrame({
'item':['a','b'],
'Jan':[10,3],
'Feb':[5,6],
'Mar':[7, None]
})Exercise checklist
- Data types are correct: numbers numeric, dates datetime, text standardized.
- No unexpected NaNs remain unless explicitly allowed.
- Joins audited: checked row counts and _merge indicator or value_counts after merge.
- Aggregations have intended grouping keys and fill_value decisions documented.
- Reshapes preserve totals (sanity-check sums before/after).
Common mistakes and self-check
- Forgetting errors='coerce' when parsing numbers/dates. Self-check: inspect df.info() and df.isna().sum().
- Silent many-to-many merges causing duplication. Self-check: compare pre/post row counts; use validate='one_to_one' or 'one_to_many'.
- Mixing .loc and .iloc. Rule: .loc uses labels; .iloc uses integer positions.
- Creating columns via chained indexing (SettingWithCopyWarning). Use df.loc[:, 'col'] = ... or df = df.assign(col=...).
- Aggregating without fill_value, leading to NaNs. Decide and document policies (e.g., fill_value=0).
- Not standardizing text (case/whitespace). Normalize early: .str.strip().str.lower().
Practical projects
- Retail dashboard dataset: Clean orders, join products, compute daily revenue, top items per country, and return a pivot of revenue by week x country.
- Survey tidy-up: Normalize Likert-scale responses, melt multi-select columns to long, and compute respondent-level features.
- Feature store seed: From raw logs, parse timestamps, engineer session features, and export a model-ready parquet with typed columns.
Mini challenge
Given a DataFrame of transactions with columns: ts (string timestamps), user_id, amount (string like '$1,200'), and city:
- Parse ts to datetime (UTC), extract date and hour.
- Clean amount to float.
- Standardize city names (trim, title case).
- Output revenue per city per day as a pivot (rows: date, columns: city, values: revenue, missing -> 0).
Hint
Use assign + to_datetime + str methods + pivot_table with fill_value=0.
Next steps
- Practice method chaining on your own data; aim for 8–12 clear steps.
- Add data validation checks (row counts, unique keys) as assertions.
- Learn parquet and feather for faster IO; keep types consistent across runs.
Quick test
This quick test is available to everyone; only logged-in users get saved progress.