How to learn Data Wrangling With Pandas for Python (pandas and numpy) in Data Scientist for free

Why this matters

As a Data Scientist, most of your time is spent preparing data. Pandas is the workhorse you use to:

Load messy CSVs, spreadsheets, and JSON into tidy DataFrames.
Fix types, parse dates, standardize text, and handle missing values.
Merge data from multiple sources reliably.
Aggregate, reshape, and create features for modeling.
Build repeatable cleaning pipelines that other people (and future you) can trust.

Who this is for

Beginners to intermediate Python users who want a solid, job-ready workflow for preparing datasets for analysis and machine learning.

Prerequisites

Basic Python (variables, functions, lists, dicts).
Intro pandas (DataFrame, Series, selecting columns).
Comfort reading simple error messages.

Learning path

Learn the mental model of DataFrames and tidy data.
Load and inspect data safely.
Clean values and types; handle missing data.
Join datasets without surprises.
Aggregate and reshape for analysis or modeling.
Package steps into clear, reproducible pipelines.

Concept explained simply

Think of a DataFrame like a spreadsheet with superpowers: every column has a type (number, text, datetime), and you can transform large datasets with a few vectorized operations.

Mental model: Data wrangling is moving from raw to reliable. You take raw tables, apply a series of small, named steps (clean, type, join, shape), and end with a tidy table ready for charts or models.

Key ideas to remember

Tidy data: each variable is a column, each observation a row, each type of observational unit a table.
Use method chaining (df.something().something_else()) to express a clear pipeline.
Prefer vectorized operations over row-by-row loops for speed and clarity.

Worked examples

Example 1 — Load and inspect safely

import pandas as pd
from io import StringIO
raw = StringIO(
    """order_id,date,item,price,qty
    1001,2023-01-02,  Widget A ,$12.5,2
    1002,01/03/2023,Widget B,$8.00,3
    1003,2023/01/04,widget a,$12.50,
    1004,,Widget C,$5,1
    """
)
df = (pd.read_csv(raw)
        .assign(
            date=lambda d: pd.to_datetime(d['date'], errors='coerce'),
            item=lambda d: d['item'].str.strip().str.lower(),
            price=lambda d: d['price'].str.replace('$','', regex=False).astype(float),
            qty=lambda d: pd.to_numeric(d['qty'], errors='coerce')
        ))
print(df.info())
print(df.head())

We parsed dates, cleaned text/currency, and made numeric columns numeric. errors='coerce' prevents crashes and marks bad values as NaT/NaN.

Example 2 — Handle missing values

# Decide policy: quantity missing -> assume 0; date missing -> drop row
clean = (df
    .assign(qty=lambda d: d['qty'].fillna(0))
    .dropna(subset=['date'])
)
# Create revenue feature
clean = clean.assign(revenue=lambda d: d['price'] * d['qty'])
print(clean)

Define missing-data rules explicitly in code so they are auditable.

Example 3 — Merge datasets

customers = pd.DataFrame({
    'customer_id': [1,2,3],
    'country': ['US','DE','US']
})
orders = pd.DataFrame({
    'order_id':[1001,1002,1005],
    'customer_id':[1,2,4]  # Note: customer 4 missing in customers
})
merged = orders.merge(customers, on='customer_id', how='left', indicator=True)
print(merged)
# Use _merge to audit join quality
print(merged['_merge'].value_counts())

Use indicator=True to see which rows matched. Left join keeps all orders and shows missing customer info clearly.

Example 4 — Aggregate and reshape

sales = pd.DataFrame({
    'country':['US','US','DE','DE'],
    'item':['widget a','widget b','widget a','widget b'],
    'revenue':[100,60,80,50]
})
# Aggregate revenue per country
agg = sales.groupby('country', as_index=False)['revenue'].sum()
print(agg)
# Pivot to compare items by country
pivot = sales.pivot_table(index='country', columns='item', values='revenue', aggfunc='sum', fill_value=0)
print(pivot)

Groupby reduces rows; pivot reshapes long to wide for side-by-side comparison.

Example 5 — Build a readable pipeline

def clean_prices(s):
    return (s.str.replace('$','', regex=False)
              .str.replace(',','', regex=False)
              .astype(float))

pipeline = (
    pd.DataFrame({
        'date':['2023-03-01','2023-03-02','bad'],
        'price':['$1,200.50','$99.90','$5'],
        'qty':[1,2,None],
        'item':[' A ','b','A']
    })
    .assign(
        date=lambda d: pd.to_datetime(d['date'], errors='coerce'),
        price=lambda d: clean_prices(d['price']),
        qty=lambda d: d['qty'].fillna(0).astype(int),
        item=lambda d: d['item'].str.strip().str.lower().astype('category')
    )
    .dropna(subset=['date'])
)
print(pipeline.dtypes)
print(pipeline)

Small helper functions and assign keep your steps explicit and testable.

Exercises

These mirror the coding exercises below. Try them before opening solutions. Tip: Copy the sample data exactly.

Exercise 1 — Clean messy sales CSV

Goal: read a CSV, standardize types, and compute revenue.

Read from a string buffer.
Strip item names and lowercase.
Convert price to float (remove currency symbol).
Convert qty to numeric; missing -> 0.
Parse date; drop rows with invalid date.
Add revenue = price * qty.

csv = """order_id,date,item,price,qty
2001,2023-02-01, Widget A ,$10.0,1
2002,02/03/2023,widget B,$7.5,2
2003,,Widget C,$5,
"""
# Produce a DataFrame with columns: order_id(int), date(datetime64), item(str lower), price(float), qty(int), revenue(float)

Exercise 2 — Join and aggregate

Goal: merge customers and orders, then compute revenue per country.

Left join orders to customers on customer_id.
Flag unmatched rows.
Aggregate revenue by country (missing country -> label as 'UNKNOWN').

orders_csv = """order_id,customer_id,revenue
3001,1,20
3002,2,15
3003,4,8
"""
customers_csv = """customer_id,country
1,US
2,FR
3,DE
"""

Exercise 3 — Reshape wide to long (melt) and back (pivot)

Goal: tidy monthly sales.

Start wide with columns Jan, Feb, Mar.
Melt to long with columns: item, month, sales.
Ensure month is ordered categorical.
Pivot back to wide; fill missing with 0.

import pandas as pd
wide = pd.DataFrame({
    'item':['a','b'],
    'Jan':[10,3],
    'Feb':[5,6],
    'Mar':[7, None]
})

Exercise checklist

Data types are correct: numbers numeric, dates datetime, text standardized.
No unexpected NaNs remain unless explicitly allowed.
Joins audited: checked row counts and _merge indicator or value_counts after merge.
Aggregations have intended grouping keys and fill_value decisions documented.
Reshapes preserve totals (sanity-check sums before/after).

Common mistakes and self-check

Forgetting errors='coerce' when parsing numbers/dates. Self-check: inspect df.info() and df.isna().sum().
Silent many-to-many merges causing duplication. Self-check: compare pre/post row counts; use validate='one_to_one' or 'one_to_many'.
Mixing .loc and .iloc. Rule: .loc uses labels; .iloc uses integer positions.
Creating columns via chained indexing (SettingWithCopyWarning). Use df.loc[:, 'col'] = ... or df = df.assign(col=...).
Aggregating without fill_value, leading to NaNs. Decide and document policies (e.g., fill_value=0).
Not standardizing text (case/whitespace). Normalize early: .str.strip().str.lower().

Practical projects

Retail dashboard dataset: Clean orders, join products, compute daily revenue, top items per country, and return a pivot of revenue by week x country.
Survey tidy-up: Normalize Likert-scale responses, melt multi-select columns to long, and compute respondent-level features.
Feature store seed: From raw logs, parse timestamps, engineer session features, and export a model-ready parquet with typed columns.

Mini challenge

Given a DataFrame of transactions with columns: ts (string timestamps), user_id, amount (string like '$1,200'), and city:

Parse ts to datetime (UTC), extract date and hour.
Clean amount to float.
Standardize city names (trim, title case).
Output revenue per city per day as a pivot (rows: date, columns: city, values: revenue, missing -> 0).

Hint

Use assign + to_datetime + str methods + pivot_table with fill_value=0.

Next steps

Practice method chaining on your own data; aim for 8–12 clear steps.
Add data validation checks (row counts, unique keys) as assertions.
Learn parquet and feather for faster IO; keep types consistent across runs.

Quick test

This quick test is available to everyone; only logged-in users get saved progress.

Menu

Data Wrangling With Pandas

Table of Contents

Why this matters

Who this is for

Prerequisites

Learning path

Concept explained simply

Worked examples

Example 1 — Load and inspect safely

Example 2 — Handle missing values

Example 3 — Merge datasets

Example 4 — Aggregate and reshape

Example 5 — Build a readable pipeline

Exercises

Exercise checklist

Common mistakes and self-check

Practical projects

Mini challenge

Next steps

Quick test

Practice Exercises

Clean messy sales data with pandas

Instructions

Expected Output

Join customers to orders and compute revenue per country

Reshape monthly data (melt and pivot)

Data Wrangling With Pandas — Quick Test

Have questions about Data Wrangling With Pandas?

AI Assistant