luvv to helpDiscover the Best Free Online Tools
Topic 4 of 10

Data Wrangling With Pandas

Learn Data Wrangling With Pandas for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Data Scientist, most of your time is spent preparing data. Pandas is the workhorse you use to:

  • Load messy CSVs, spreadsheets, and JSON into tidy DataFrames.
  • Fix types, parse dates, standardize text, and handle missing values.
  • Merge data from multiple sources reliably.
  • Aggregate, reshape, and create features for modeling.
  • Build repeatable cleaning pipelines that other people (and future you) can trust.

Who this is for

Beginners to intermediate Python users who want a solid, job-ready workflow for preparing datasets for analysis and machine learning.

Prerequisites

  • Basic Python (variables, functions, lists, dicts).
  • Intro pandas (DataFrame, Series, selecting columns).
  • Comfort reading simple error messages.

Learning path

  1. Learn the mental model of DataFrames and tidy data.
  2. Load and inspect data safely.
  3. Clean values and types; handle missing data.
  4. Join datasets without surprises.
  5. Aggregate and reshape for analysis or modeling.
  6. Package steps into clear, reproducible pipelines.

Concept explained simply

Think of a DataFrame like a spreadsheet with superpowers: every column has a type (number, text, datetime), and you can transform large datasets with a few vectorized operations.

Mental model: Data wrangling is moving from raw to reliable. You take raw tables, apply a series of small, named steps (clean, type, join, shape), and end with a tidy table ready for charts or models.

Key ideas to remember
  • Tidy data: each variable is a column, each observation a row, each type of observational unit a table.
  • Use method chaining (df.something().something_else()) to express a clear pipeline.
  • Prefer vectorized operations over row-by-row loops for speed and clarity.

Worked examples

Example 1 — Load and inspect safely

import pandas as pd
from io import StringIO
raw = StringIO(
    """order_id,date,item,price,qty
    1001,2023-01-02,  Widget A ,$12.5,2
    1002,01/03/2023,Widget B,$8.00,3
    1003,2023/01/04,widget a,$12.50,
    1004,,Widget C,$5,1
    """
)
df = (pd.read_csv(raw)
        .assign(
            date=lambda d: pd.to_datetime(d['date'], errors='coerce'),
            item=lambda d: d['item'].str.strip().str.lower(),
            price=lambda d: d['price'].str.replace('$','', regex=False).astype(float),
            qty=lambda d: pd.to_numeric(d['qty'], errors='coerce')
        ))
print(df.info())
print(df.head())

We parsed dates, cleaned text/currency, and made numeric columns numeric. errors='coerce' prevents crashes and marks bad values as NaT/NaN.

Example 2 — Handle missing values

# Decide policy: quantity missing -> assume 0; date missing -> drop row
clean = (df
    .assign(qty=lambda d: d['qty'].fillna(0))
    .dropna(subset=['date'])
)
# Create revenue feature
clean = clean.assign(revenue=lambda d: d['price'] * d['qty'])
print(clean)

Define missing-data rules explicitly in code so they are auditable.

Example 3 — Merge datasets

customers = pd.DataFrame({
    'customer_id': [1,2,3],
    'country': ['US','DE','US']
})
orders = pd.DataFrame({
    'order_id':[1001,1002,1005],
    'customer_id':[1,2,4]  # Note: customer 4 missing in customers
})
merged = orders.merge(customers, on='customer_id', how='left', indicator=True)
print(merged)
# Use _merge to audit join quality
print(merged['_merge'].value_counts())

Use indicator=True to see which rows matched. Left join keeps all orders and shows missing customer info clearly.

Example 4 — Aggregate and reshape

sales = pd.DataFrame({
    'country':['US','US','DE','DE'],
    'item':['widget a','widget b','widget a','widget b'],
    'revenue':[100,60,80,50]
})
# Aggregate revenue per country
agg = sales.groupby('country', as_index=False)['revenue'].sum()
print(agg)
# Pivot to compare items by country
pivot = sales.pivot_table(index='country', columns='item', values='revenue', aggfunc='sum', fill_value=0)
print(pivot)

Groupby reduces rows; pivot reshapes long to wide for side-by-side comparison.

Example 5 — Build a readable pipeline

def clean_prices(s):
    return (s.str.replace('$','', regex=False)
              .str.replace(',','', regex=False)
              .astype(float))

pipeline = (
    pd.DataFrame({
        'date':['2023-03-01','2023-03-02','bad'],
        'price':['$1,200.50','$99.90','$5'],
        'qty':[1,2,None],
        'item':[' A ','b','A']
    })
    .assign(
        date=lambda d: pd.to_datetime(d['date'], errors='coerce'),
        price=lambda d: clean_prices(d['price']),
        qty=lambda d: d['qty'].fillna(0).astype(int),
        item=lambda d: d['item'].str.strip().str.lower().astype('category')
    )
    .dropna(subset=['date'])
)
print(pipeline.dtypes)
print(pipeline)

Small helper functions and assign keep your steps explicit and testable.

Exercises

These mirror the coding exercises below. Try them before opening solutions. Tip: Copy the sample data exactly.

Exercise 1 — Clean messy sales CSV

Goal: read a CSV, standardize types, and compute revenue.

  • Read from a string buffer.
  • Strip item names and lowercase.
  • Convert price to float (remove currency symbol).
  • Convert qty to numeric; missing -> 0.
  • Parse date; drop rows with invalid date.
  • Add revenue = price * qty.
csv = """order_id,date,item,price,qty
2001,2023-02-01, Widget A ,$10.0,1
2002,02/03/2023,widget B,$7.5,2
2003,,Widget C,$5,
"""
# Produce a DataFrame with columns: order_id(int), date(datetime64), item(str lower), price(float), qty(int), revenue(float)
Exercise 2 — Join and aggregate

Goal: merge customers and orders, then compute revenue per country.

  • Left join orders to customers on customer_id.
  • Flag unmatched rows.
  • Aggregate revenue by country (missing country -> label as 'UNKNOWN').
orders_csv = """order_id,customer_id,revenue
3001,1,20
3002,2,15
3003,4,8
"""
customers_csv = """customer_id,country
1,US
2,FR
3,DE
"""
Exercise 3 — Reshape wide to long (melt) and back (pivot)

Goal: tidy monthly sales.

  • Start wide with columns Jan, Feb, Mar.
  • Melt to long with columns: item, month, sales.
  • Ensure month is ordered categorical.
  • Pivot back to wide; fill missing with 0.
import pandas as pd
wide = pd.DataFrame({
    'item':['a','b'],
    'Jan':[10,3],
    'Feb':[5,6],
    'Mar':[7, None]
})

Exercise checklist

  • Data types are correct: numbers numeric, dates datetime, text standardized.
  • No unexpected NaNs remain unless explicitly allowed.
  • Joins audited: checked row counts and _merge indicator or value_counts after merge.
  • Aggregations have intended grouping keys and fill_value decisions documented.
  • Reshapes preserve totals (sanity-check sums before/after).

Common mistakes and self-check

  • Forgetting errors='coerce' when parsing numbers/dates. Self-check: inspect df.info() and df.isna().sum().
  • Silent many-to-many merges causing duplication. Self-check: compare pre/post row counts; use validate='one_to_one' or 'one_to_many'.
  • Mixing .loc and .iloc. Rule: .loc uses labels; .iloc uses integer positions.
  • Creating columns via chained indexing (SettingWithCopyWarning). Use df.loc[:, 'col'] = ... or df = df.assign(col=...).
  • Aggregating without fill_value, leading to NaNs. Decide and document policies (e.g., fill_value=0).
  • Not standardizing text (case/whitespace). Normalize early: .str.strip().str.lower().

Practical projects

  • Retail dashboard dataset: Clean orders, join products, compute daily revenue, top items per country, and return a pivot of revenue by week x country.
  • Survey tidy-up: Normalize Likert-scale responses, melt multi-select columns to long, and compute respondent-level features.
  • Feature store seed: From raw logs, parse timestamps, engineer session features, and export a model-ready parquet with typed columns.

Mini challenge

Given a DataFrame of transactions with columns: ts (string timestamps), user_id, amount (string like '$1,200'), and city:

  • Parse ts to datetime (UTC), extract date and hour.
  • Clean amount to float.
  • Standardize city names (trim, title case).
  • Output revenue per city per day as a pivot (rows: date, columns: city, values: revenue, missing -> 0).
Hint

Use assign + to_datetime + str methods + pivot_table with fill_value=0.

Next steps

  • Practice method chaining on your own data; aim for 8–12 clear steps.
  • Add data validation checks (row counts, unique keys) as assertions.
  • Learn parquet and feather for faster IO; keep types consistent across runs.

Quick test

This quick test is available to everyone; only logged-in users get saved progress.

Practice Exercises

3 exercises to complete

Instructions

Read the CSV from a string, standardize types, and compute revenue.

  1. Load CSV using StringIO.
  2. Strip item names and convert to lowercase.
  3. Convert price from '$12.5' to float.
  4. Convert qty to numeric; missing -> 0; make it int.
  5. Parse date; drop rows with invalid date.
  6. Add revenue = price * qty.
from io import StringIO
import pandas as pd
csv = """order_id,date,item,price,qty
2001,2023-02-01, Widget A ,$10.0,1
2002,02/03/2023,widget B,$7.5,2
2003,,Widget C,$5,
"""
raw = StringIO(csv)
# Your code below
Expected Output
A DataFrame with columns [order_id:int, date:datetime64, item:str lowercased, price:float, qty:int, revenue:float] containing 2 valid rows (order_id 2001 and 2002).

Data Wrangling With Pandas — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Data Wrangling With Pandas?

AI Assistant

Ask questions about this tool