luvv to helpDiscover the Best Free Online Tools

Python pandas

Learn Python pandas for Data Analyst for free: roadmap, examples, subskills, and a skill exam.

Published: December 20, 2025 | Updated: December 20, 2025

Why pandas matters for Data Analysts

Pandas is the backbone of day-to-day analytics in Python. It lets you load messy real-world data, clean it quickly, explore patterns, and produce trustworthy tables, charts, and outputs for stakeholders. As a Data Analyst, pandas speeds up tasks like: importing CSV/Excel/SQL data, fixing missing values, reshaping tables for reporting, aggregating KPIs, and exporting clean results for dashboards.

  • Typical tasks it unlocks: quick EDA, KPI computation, cohort/retention tables, A/B summaries, funnel steps, time-series rollups, feature preparation for modeling, and reproducible data prep pipelines.
  • Time-to-value: minutes to load, clean, and summarize real data—ideal for ad-hoc questions.

Who this is for

  • Aspiring and junior Data Analysts who want a practical, job-ready Python data workflow.
  • Excel/SQL users transitioning to Python for automation and larger datasets.
  • Analysts who need reproducible, documented data prep pipelines.

Prerequisites

  • Basic Python (variables, functions, lists/dicts) and comfort with the terminal.
  • Familiarity with tabular data (rows/columns) and basic statistics (mean, median).
  • Nice-to-have: basic SQL and Excel; you'll map those habits to pandas quickly.

Learning path (pandas roadmap)

  1. Setup: Install Python, pandas, and a notebook environment (e.g., Jupyter). Learn how to create and activate a virtual environment.
  2. Data I/O: Read and write CSV, Excel, JSON, Parquet; connect to SQL.
  3. Core objects: Series and DataFrame basics; understand indexes.
  4. Select/filter: loc/iloc, boolean masks, sorting, ranking.
  5. Clean: Handle missing values, types, strings, datetimes, categories.
  6. Transform: Create columns, assign/where, apply/map/vectorize.
  7. Aggregate: groupby, multiple aggregations, transform/filter.
  8. Reshape/join: merge, concat, pivot, melt, MultiIndex.
  9. Performance: memory, profiling, chunking, speed-ups.
  10. Outputs: validation, tidy exports, reproducible pipelines, basic plots.
Milestone checklist
  • Load 3 formats (CSV, Excel, Parquet) and write outputs.
  • Use loc/iloc, boolean masks, sort_values, and rank.
  • Clean missing values with fillna/dropna and type conversions.
  • Do a groupby with multiple aggregations and a transform step.
  • Join two tables and reshape with pivot_table.
  • Validate expectations and export a final report.

Worked examples

1) Quick sales summary by region and month

import pandas as pd

sales = pd.DataFrame({
    "order_id": [1,2,3,4,5,6],
    "region": ["NA","EU","NA","APAC","EU","NA"],
    "order_date": pd.to_datetime(["2024-01-03","2024-01-18","2024-02-02","2024-02-12","2024-02-25","2024-03-01"]),
    "revenue": [120, 80, 60, 200, 90, 130]
})

sales["month"] = sales["order_date"].dt.to_period("M").dt.to_timestamp()
summary = (sales
    .groupby(["region","month"], as_index=False)
    .agg(revenue_sum=("revenue","sum"), orders=("order_id","count"))
)
print(summary)

What you get: monthly revenue and order counts per region for reporting.

2) Cleaning missing values and fixing types

import pandas as pd
raw = pd.DataFrame({
    "user_id": [101, 102, 103, 104],
    "age": ["25", None, " 31 ", "unknown"],
    "joined": ["2023-10-01", "2023-11-15", None, "2024/01/03"]
})

# Clean age
raw["age"] = (raw["age"].replace("unknown", pd.NA)
                           .str.strip()
                           .pipe(pd.to_numeric, errors="coerce"))

# Parse dates
raw["joined"] = pd.to_datetime(raw["joined"], errors="coerce")

# Fill missing age with median, drop rows with no join date
raw["age"] = raw["age"].fillna(raw["age"].median())
clean = raw.dropna(subset=["joined"]) 
print(clean)

What you get: robust cleaning using to_numeric, to_datetime, fillna, and dropna.

3) Customer retention cohort

import pandas as pd

orders = pd.DataFrame({
    "user_id": [1,1,2,2,3,3,3,4],
    "order_date": pd.to_datetime(["2024-01-10","2024-02-05","2024-01-20","2024-03-01","2024-02-12","2024-03-02","2024-04-01","2024-02-25"]) 
})

orders["cohort"] = orders.groupby("user_id")["order_date"].transform("min").dt.to_period("M")
orders["month"]  = orders["order_date"].dt.to_period("M")

ret_table = (orders.assign(lag=(orders["month"] - orders["cohort"]).apply(lambda x: x.n))
    .groupby(["cohort","lag"], as_index=False)
    .agg(users=("user_id", pd.Series.nunique))
    .pivot_table(index="cohort", columns="lag", values="users", fill_value=0)
)
print(ret_table)

What you get: a cohort table showing user repeat behavior over months.

4) Joining product info and reshaping for a dashboard

import pandas as pd

orders = pd.DataFrame({"order_id":[1,2,3,4], "product_id":[10,10,11,12], "qty":[2,1,3,5]})
products = pd.DataFrame({"product_id":[10,11,12], "category":["A","B","A"], "price":[50,80,20]})

joined = orders.merge(products, on="product_id", how="left")
joined["revenue"] = joined["qty"] * joined["price"]

pivot = joined.pivot_table(index="category", values=["qty","revenue"], aggfunc="sum")
print(pivot)

What you get: tidy summaries for categories after a simple merge and pivot.

5) Fast string cleaning at scale

import pandas as pd

names = pd.Series(["  Alice  ", "bob", None, "CHARLIE "])
clean = (names
    .fillna("")
    .str.strip()
    .str.title()
)
print(clean)

What you get: vectorized string operations for speed and clarity.

6) Time-series resampling

import pandas as pd
rng = pd.date_range("2024-01-01", periods=6, freq="D")
ts = pd.DataFrame({"timestamp": rng, "value": [1,2,3,6,2,5]}).set_index("timestamp")
weekly = ts.resample("W").sum()
print(weekly)

What you get: cohesive weekly totals from daily data.

Drills and exercises

  • Load a CSV, inspect head(), info(), describe(), and value_counts() on a key column.
  • Filter rows by multiple conditions (AND/OR), then sort by two columns.
  • Convert a messy numeric column from strings (with commas) to numeric and fill missing values.
  • Create two derived columns using assign and where for conditional logic.
  • Group by two keys and compute mean, median, and count in one go; add a transform-based z-score.
  • Merge two tables with a left join; validate the row count and check for unexpected nulls.
  • Reshape a long table to wide with pivot_table, then back to long with melt.
  • Profile memory usage, convert to category where appropriate, and measure improvement.
  • Process a 2M-row CSV in chunks; compute a running aggregation and write a clean output.

Common mistakes and debugging tips

  • SettingWithCopyWarning: Happens when modifying a view. Use .loc and assign back: df.loc[mask, "col"] = value.
  • Unexpected NaNs after merge: Keys don’t match exactly (types/whitespace). Normalize types and strip strings before joining.
  • Slow loops: Prefer vectorized ops or .map/.apply on Series; avoid row-wise loops where possible.
  • Timezone confusion: Make times timezone-aware early (dt.tz_localize/dt.tz_convert) and be consistent.
  • Wrong groupby results: Confirm group columns are the expected dtype (e.g., category) and check as_index parameter.
  • Memory blowups: Downcast numerics (to_numeric with downcast), convert to category, or read/process in chunks.
Quick debugging mini-tasks
  • Print df.columns.tolist() to confirm exact column names after reading files.
  • Use df.sample(5, random_state=0) to inspect representative rows.
  • Validate invariants with assertions: assert df["price"].ge(0).all()
  • Use .pipe() to chain steps and print shapes after each transformation.

Mini project: Weekly revenue monitor

Goal: Build a reproducible pipeline that loads orders, enriches with product data, cleans fields, and outputs a weekly revenue and order count report.

Steps
  1. Read orders (CSV) and products (CSV/Excel); assert required columns exist.
  2. Clean: parse dates, fix numeric types; fill missing product prices with median (and flag).
  3. Join: left-join products; compute revenue and a discounted_revenue column (where if category == "A" apply 5% discount).
  4. Aggregate: resample weekly; compute orders, revenue, discounted_revenue; add WoW percent change.
  5. Validate: no negative revenues; weeks are non-empty; export to CSV and Parquet.
  6. Plot: basic line plot of weekly revenue using pandas .plot().
import pandas as pd

orders = pd.read_csv("orders.csv", parse_dates=["order_date"]) 
products = pd.read_csv("products.csv")

required_o = {"order_id","product_id","qty","order_date"}
required_p = {"product_id","price","category"}
assert required_o.issubset(orders.columns)
assert required_p.issubset(products.columns)

orders["qty"] = pd.to_numeric(orders["qty"], errors="coerce").fillna(0).astype(int)
products["price"] = pd.to_numeric(products["price"], errors="coerce")
products["price"] = products["price"].fillna(products["price"].median())

joined = orders.merge(products, on="product_id", how="left")
joined["revenue"] = joined["qty"] * joined["price"]
joined["discounted_revenue"] = joined["revenue"].where(joined["category"] != "A", joined["revenue"] * 0.95)

weekly = (joined.set_index("order_date")
    .resample("W")
    .agg(orders=("order_id","count"), revenue=("revenue","sum"), disc_revenue=("discounted_revenue","sum"))
)
weekly["wow_revenue_pct"] = weekly["revenue"].pct_change() * 100
assert (weekly[["revenue","disc_revenue"]] >= 0).all().all()

weekly.to_csv("weekly_report.csv", index=True)
weekly.to_parquet("weekly_report.parquet")

print(weekly.tail())

Subskills

Explore focused subskills to master pandas efficiently. Open each, practice, and return for the exam.

  • Installing and Environment Setup: Create a virtual env, install pandas, set up Jupyter/VS Code; run a quick import test.
  • Reading and Writing Data (CSV, Excel, JSON, Parquet, SQL): Load multiple formats, handle encodings, and export cleaned results.
  • Understanding DataFrame and Series Basics: Navigate core objects, indexes, and fundamental methods.
  • Indexing and Selection (loc/iloc/at/iat): Select rows/cols safely and efficiently.
  • Boolean Filtering and query: Build clear, chainable row filters.
  • Sorting and Ranking: Order data and compute dense/rank methods.
  • Handling Missing Values (isna/fillna/dropna/interpolation): Diagnose and treat missingness confidently.
  • Data Types and Type Conversion (astype/to_numeric/to_datetime): Ensure correct dtypes for accurate calculations.
  • String Operations with .str accessor: Clean text at scale with vectorized functions.
  • Datetime Operations, Timezones, Resampling: Parse, localize/convert, and aggregate time-series.
  • Creating and Transforming Columns (assign/where): Build readable, testable transformation steps.
  • apply/map/applymap, Vectorization Basics: Choose the right tool for speed and clarity.
  • groupby Aggregations, Multiple Aggregations, transform, filter: Create KPIs and cohort-like metrics.
  • Window (rolling/expanding/ewm): Compute moving averages and trends.
  • Merge/Join, concat, append patterns: Combine datasets without surprises.
  • Reshaping (melt/pivot/pivot_table/stack/unstack): Switch between long and wide formats.
  • Working with MultiIndex: Manage hierarchical indexes for complex analyses.
  • Handling Categorical Data and category dtype: Optimize memory and enforce valid values.
  • Duplicates, unique, nunique: Detect and resolve duplication.
  • Binning with cut/qcut: Create buckets for segmentation.
  • value_counts, crosstab: Quick distributions and contingency tables.
  • Working with Lists and Dicts (explode, json_normalize): Tame nested/array data.
  • Working with Missing and Nullable dtypes: Use pandas' nullable types for consistency.
  • Memory Usage and Optimization: Fit more data into RAM safely.
  • Performance Profiling and Speedups: Measure and improve runtimes.
  • Chunking and Streaming Large Files: Process big data incrementally.
  • Data Validation with Assertions: Catch issues early with checks.
  • Exporting Clean Outputs and Reproducible Pipelines: Deliver reliable artifacts and pipelines.
  • Basic Plotting with pandas: Create quick visuals for reports.
  • Interop with NumPy and Matplotlib: Extend pandas with scientific Python.

Next steps

  • Work through the subskills below and complete the drills.
  • Build the mini project and save your pipeline.
  • Take the skill exam. Anyone can attempt; logged-in users get progress saving.

Python pandas — Skill Exam

Test your pandas skills. You can take this exam for free. Your progress and best score are saved if you are logged in; otherwise, results are local to this session. Aim for 70% to pass.Open-book: you may use your notes.No time limit; complete in one or multiple sittings.Code questions check your understanding of outputs and correct usage.

10 questions70% to pass

Have questions about Python pandas?

AI Assistant

Ask questions about this tool