Skill Not Found

Why pandas matters for Data Analysts

Pandas is the backbone of day-to-day analytics in Python. It lets you load messy real-world data, clean it quickly, explore patterns, and produce trustworthy tables, charts, and outputs for stakeholders. As a Data Analyst, pandas speeds up tasks like: importing CSV/Excel/SQL data, fixing missing values, reshaping tables for reporting, aggregating KPIs, and exporting clean results for dashboards.

Typical tasks it unlocks: quick EDA, KPI computation, cohort/retention tables, A/B summaries, funnel steps, time-series rollups, feature preparation for modeling, and reproducible data prep pipelines.
Time-to-value: minutes to load, clean, and summarize real data—ideal for ad-hoc questions.

Who this is for

Aspiring and junior Data Analysts who want a practical, job-ready Python data workflow.
Excel/SQL users transitioning to Python for automation and larger datasets.
Analysts who need reproducible, documented data prep pipelines.

Prerequisites

Basic Python (variables, functions, lists/dicts) and comfort with the terminal.
Familiarity with tabular data (rows/columns) and basic statistics (mean, median).
Nice-to-have: basic SQL and Excel; you'll map those habits to pandas quickly.

Learning path (pandas roadmap)

Setup: Install Python, pandas, and a notebook environment (e.g., Jupyter). Learn how to create and activate a virtual environment.
Data I/O: Read and write CSV, Excel, JSON, Parquet; connect to SQL.
Core objects: Series and DataFrame basics; understand indexes.
Select/filter: loc/iloc, boolean masks, sorting, ranking.
Clean: Handle missing values, types, strings, datetimes, categories.
Transform: Create columns, assign/where, apply/map/vectorize.
Aggregate: groupby, multiple aggregations, transform/filter.
Reshape/join: merge, concat, pivot, melt, MultiIndex.
Performance: memory, profiling, chunking, speed-ups.
Outputs: validation, tidy exports, reproducible pipelines, basic plots.

Milestone checklist

Load 3 formats (CSV, Excel, Parquet) and write outputs.
Use loc/iloc, boolean masks, sort_values, and rank.
Clean missing values with fillna/dropna and type conversions.
Do a groupby with multiple aggregations and a transform step.
Join two tables and reshape with pivot_table.
Validate expectations and export a final report.

Worked examples

1) Quick sales summary by region and month

import pandas as pd

sales = pd.DataFrame({
    "order_id": [1,2,3,4,5,6],
    "region": ["NA","EU","NA","APAC","EU","NA"],
    "order_date": pd.to_datetime(["2024-01-03","2024-01-18","2024-02-02","2024-02-12","2024-02-25","2024-03-01"]),
    "revenue": [120, 80, 60, 200, 90, 130]
})

sales["month"] = sales["order_date"].dt.to_period("M").dt.to_timestamp()
summary = (sales
    .groupby(["region","month"], as_index=False)
    .agg(revenue_sum=("revenue","sum"), orders=("order_id","count"))
)
print(summary)

What you get: monthly revenue and order counts per region for reporting.

2) Cleaning missing values and fixing types

import pandas as pd
raw = pd.DataFrame({
    "user_id": [101, 102, 103, 104],
    "age": ["25", None, " 31 ", "unknown"],
    "joined": ["2023-10-01", "2023-11-15", None, "2024/01/03"]
})

# Clean age
raw["age"] = (raw["age"].replace("unknown", pd.NA)
                           .str.strip()
                           .pipe(pd.to_numeric, errors="coerce"))

# Parse dates
raw["joined"] = pd.to_datetime(raw["joined"], errors="coerce")

# Fill missing age with median, drop rows with no join date
raw["age"] = raw["age"].fillna(raw["age"].median())
clean = raw.dropna(subset=["joined"]) 
print(clean)

What you get: robust cleaning using to_numeric, to_datetime, fillna, and dropna.

3) Customer retention cohort

import pandas as pd

orders = pd.DataFrame({
    "user_id": [1,1,2,2,3,3,3,4],
    "order_date": pd.to_datetime(["2024-01-10","2024-02-05","2024-01-20","2024-03-01","2024-02-12","2024-03-02","2024-04-01","2024-02-25"]) 
})

orders["cohort"] = orders.groupby("user_id")["order_date"].transform("min").dt.to_period("M")
orders["month"]  = orders["order_date"].dt.to_period("M")

ret_table = (orders.assign(lag=(orders["month"] - orders["cohort"]).apply(lambda x: x.n))
    .groupby(["cohort","lag"], as_index=False)
    .agg(users=("user_id", pd.Series.nunique))
    .pivot_table(index="cohort", columns="lag", values="users", fill_value=0)
)
print(ret_table)

What you get: a cohort table showing user repeat behavior over months.

4) Joining product info and reshaping for a dashboard

import pandas as pd

orders = pd.DataFrame({"order_id":[1,2,3,4], "product_id":[10,10,11,12], "qty":[2,1,3,5]})
products = pd.DataFrame({"product_id":[10,11,12], "category":["A","B","A"], "price":[50,80,20]})

joined = orders.merge(products, on="product_id", how="left")
joined["revenue"] = joined["qty"] * joined["price"]

pivot = joined.pivot_table(index="category", values=["qty","revenue"], aggfunc="sum")
print(pivot)

What you get: tidy summaries for categories after a simple merge and pivot.

5) Fast string cleaning at scale

import pandas as pd

names = pd.Series(["  Alice  ", "bob", None, "CHARLIE "])
clean = (names
    .fillna("")
    .str.strip()
    .str.title()
)
print(clean)

What you get: vectorized string operations for speed and clarity.

6) Time-series resampling

import pandas as pd
rng = pd.date_range("2024-01-01", periods=6, freq="D")
ts = pd.DataFrame({"timestamp": rng, "value": [1,2,3,6,2,5]}).set_index("timestamp")
weekly = ts.resample("W").sum()
print(weekly)

What you get: cohesive weekly totals from daily data.

Drills and exercises

Load a CSV, inspect head(), info(), describe(), and value_counts() on a key column.
Filter rows by multiple conditions (AND/OR), then sort by two columns.
Convert a messy numeric column from strings (with commas) to numeric and fill missing values.
Create two derived columns using assign and where for conditional logic.
Group by two keys and compute mean, median, and count in one go; add a transform-based z-score.
Merge two tables with a left join; validate the row count and check for unexpected nulls.
Reshape a long table to wide with pivot_table, then back to long with melt.
Profile memory usage, convert to category where appropriate, and measure improvement.
Process a 2M-row CSV in chunks; compute a running aggregation and write a clean output.

Common mistakes and debugging tips

SettingWithCopyWarning: Happens when modifying a view. Use .loc and assign back: df.loc[mask, "col"] = value.
Unexpected NaNs after merge: Keys don’t match exactly (types/whitespace). Normalize types and strip strings before joining.
Slow loops: Prefer vectorized ops or .map/.apply on Series; avoid row-wise loops where possible.
Timezone confusion: Make times timezone-aware early (dt.tz_localize/dt.tz_convert) and be consistent.
Wrong groupby results: Confirm group columns are the expected dtype (e.g., category) and check as_index parameter.
Memory blowups: Downcast numerics (to_numeric with downcast), convert to category, or read/process in chunks.

Quick debugging mini-tasks

Print df.columns.tolist() to confirm exact column names after reading files.
Use df.sample(5, random_state=0) to inspect representative rows.
Validate invariants with assertions: assert df["price"].ge(0).all()
Use .pipe() to chain steps and print shapes after each transformation.

Mini project: Weekly revenue monitor

Goal: Build a reproducible pipeline that loads orders, enriches with product data, cleans fields, and outputs a weekly revenue and order count report.

Steps

Read orders (CSV) and products (CSV/Excel); assert required columns exist.
Clean: parse dates, fix numeric types; fill missing product prices with median (and flag).
Join: left-join products; compute revenue and a discounted_revenue column (where if category == "A" apply 5% discount).
Aggregate: resample weekly; compute orders, revenue, discounted_revenue; add WoW percent change.
Validate: no negative revenues; weeks are non-empty; export to CSV and Parquet.
Plot: basic line plot of weekly revenue using pandas .plot().

import pandas as pd

orders = pd.read_csv("orders.csv", parse_dates=["order_date"]) 
products = pd.read_csv("products.csv")

required_o = {"order_id","product_id","qty","order_date"}
required_p = {"product_id","price","category"}
assert required_o.issubset(orders.columns)
assert required_p.issubset(products.columns)

orders["qty"] = pd.to_numeric(orders["qty"], errors="coerce").fillna(0).astype(int)
products["price"] = pd.to_numeric(products["price"], errors="coerce")
products["price"] = products["price"].fillna(products["price"].median())

joined = orders.merge(products, on="product_id", how="left")
joined["revenue"] = joined["qty"] * joined["price"]
joined["discounted_revenue"] = joined["revenue"].where(joined["category"] != "A", joined["revenue"] * 0.95)

weekly = (joined.set_index("order_date")
    .resample("W")
    .agg(orders=("order_id","count"), revenue=("revenue","sum"), disc_revenue=("discounted_revenue","sum"))
)
weekly["wow_revenue_pct"] = weekly["revenue"].pct_change() * 100
assert (weekly[["revenue","disc_revenue"]] >= 0).all().all()

weekly.to_csv("weekly_report.csv", index=True)
weekly.to_parquet("weekly_report.parquet")

print(weekly.tail())

Subskills

Explore focused subskills to master pandas efficiently. Open each, practice, and return for the exam.

Installing and Environment Setup: Create a virtual env, install pandas, set up Jupyter/VS Code; run a quick import test.
Reading and Writing Data (CSV, Excel, JSON, Parquet, SQL): Load multiple formats, handle encodings, and export cleaned results.
Understanding DataFrame and Series Basics: Navigate core objects, indexes, and fundamental methods.
Indexing and Selection (loc/iloc/at/iat): Select rows/cols safely and efficiently.
Boolean Filtering and query: Build clear, chainable row filters.
Sorting and Ranking: Order data and compute dense/rank methods.
Handling Missing Values (isna/fillna/dropna/interpolation): Diagnose and treat missingness confidently.
Data Types and Type Conversion (astype/to_numeric/to_datetime): Ensure correct dtypes for accurate calculations.
String Operations with .str accessor: Clean text at scale with vectorized functions.
Datetime Operations, Timezones, Resampling: Parse, localize/convert, and aggregate time-series.
Creating and Transforming Columns (assign/where): Build readable, testable transformation steps.
apply/map/applymap, Vectorization Basics: Choose the right tool for speed and clarity.
groupby Aggregations, Multiple Aggregations, transform, filter: Create KPIs and cohort-like metrics.
Window (rolling/expanding/ewm): Compute moving averages and trends.
Merge/Join, concat, append patterns: Combine datasets without surprises.
Reshaping (melt/pivot/pivot_table/stack/unstack): Switch between long and wide formats.
Working with MultiIndex: Manage hierarchical indexes for complex analyses.
Handling Categorical Data and category dtype: Optimize memory and enforce valid values.
Duplicates, unique, nunique: Detect and resolve duplication.
Binning with cut/qcut: Create buckets for segmentation.
value_counts, crosstab: Quick distributions and contingency tables.
Working with Lists and Dicts (explode, json_normalize): Tame nested/array data.
Working with Missing and Nullable dtypes: Use pandas' nullable types for consistency.
Memory Usage and Optimization: Fit more data into RAM safely.
Performance Profiling and Speedups: Measure and improve runtimes.
Chunking and Streaming Large Files: Process big data incrementally.
Data Validation with Assertions: Catch issues early with checks.
Exporting Clean Outputs and Reproducible Pipelines: Deliver reliable artifacts and pipelines.
Basic Plotting with pandas: Create quick visuals for reports.
Interop with NumPy and Matplotlib: Extend pandas with scientific Python.

Next steps

Work through the subskills below and complete the drills.
Build the mini project and save your pipeline.
Take the skill exam. Anyone can attempt; logged-in users get progress saving.

Menu

Python pandas

Table of Contents

Why pandas matters for Data Analysts

Who this is for

Prerequisites

Learning path (pandas roadmap)

Worked examples

1) Quick sales summary by region and month

2) Cleaning missing values and fixing types

3) Customer retention cohort

4) Joining product info and reshaping for a dashboard

5) Fast string cleaning at scale

6) Time-series resampling

Drills and exercises

Common mistakes and debugging tips

Mini project: Weekly revenue monitor

Subskills

Next steps

Python pandas — Skill Exam

Topics

Memory Usage and Optimization

Installing and Environment Setup

Reading and Writing Data Csv Excel Json Parquet Sql

Understanding DataFrame and Series Basics

Indexing and Selection Loc Iloc At Iat

Performance Profiling and Speedups

Boolean Filtering and Query

Sorting and Ranking

Handling Missing Values Isna Fillna Dropna Interpolation

Data Types and Type Conversion Astype To Numeric To Datetime

String Operations With Str Accessor

Datetime Operations Timezones Resampling

Creating and Transforming Columns Assign Where

Apply Map Applymap Vectorization Basics

Groupby Aggregations Multiple Aggregations Transform Filter

Window Rolling Expanding Ewm

Merge Join Concat and Append Patterns

Reshaping Melt Pivot Pivot Table Stack Unstack

Working with MultiIndex

Handling Categorical Data and Category Dtype

Duplicates Unique Nunique

Binning With Cut Qcut

Value Counts Crosstab

Working With Lists and Dicts Explode Json Normalize

Working With Missing and Nullable Dtypes

Chunking and Streaming Large Files

Exporting Clean Outputs and Reproducible Pipelines

Basic Plotting With Pandas

Data Validation With Assertions

Interop With Numpy and Matplotlib

Have questions about Python pandas?

AI Assistant