Why pandas matters for Data Analysts
Pandas is the backbone of day-to-day analytics in Python. It lets you load messy real-world data, clean it quickly, explore patterns, and produce trustworthy tables, charts, and outputs for stakeholders. As a Data Analyst, pandas speeds up tasks like: importing CSV/Excel/SQL data, fixing missing values, reshaping tables for reporting, aggregating KPIs, and exporting clean results for dashboards.
- Typical tasks it unlocks: quick EDA, KPI computation, cohort/retention tables, A/B summaries, funnel steps, time-series rollups, feature preparation for modeling, and reproducible data prep pipelines.
- Time-to-value: minutes to load, clean, and summarize real data—ideal for ad-hoc questions.
Who this is for
- Aspiring and junior Data Analysts who want a practical, job-ready Python data workflow.
- Excel/SQL users transitioning to Python for automation and larger datasets.
- Analysts who need reproducible, documented data prep pipelines.
Prerequisites
- Basic Python (variables, functions, lists/dicts) and comfort with the terminal.
- Familiarity with tabular data (rows/columns) and basic statistics (mean, median).
- Nice-to-have: basic SQL and Excel; you'll map those habits to pandas quickly.
Learning path (pandas roadmap)
- Setup: Install Python, pandas, and a notebook environment (e.g., Jupyter). Learn how to create and activate a virtual environment.
- Data I/O: Read and write CSV, Excel, JSON, Parquet; connect to SQL.
- Core objects: Series and DataFrame basics; understand indexes.
- Select/filter: loc/iloc, boolean masks, sorting, ranking.
- Clean: Handle missing values, types, strings, datetimes, categories.
- Transform: Create columns, assign/where, apply/map/vectorize.
- Aggregate: groupby, multiple aggregations, transform/filter.
- Reshape/join: merge, concat, pivot, melt, MultiIndex.
- Performance: memory, profiling, chunking, speed-ups.
- Outputs: validation, tidy exports, reproducible pipelines, basic plots.
Milestone checklist
- Load 3 formats (CSV, Excel, Parquet) and write outputs.
- Use loc/iloc, boolean masks, sort_values, and rank.
- Clean missing values with fillna/dropna and type conversions.
- Do a groupby with multiple aggregations and a transform step.
- Join two tables and reshape with pivot_table.
- Validate expectations and export a final report.
Worked examples
1) Quick sales summary by region and month
import pandas as pd
sales = pd.DataFrame({
"order_id": [1,2,3,4,5,6],
"region": ["NA","EU","NA","APAC","EU","NA"],
"order_date": pd.to_datetime(["2024-01-03","2024-01-18","2024-02-02","2024-02-12","2024-02-25","2024-03-01"]),
"revenue": [120, 80, 60, 200, 90, 130]
})
sales["month"] = sales["order_date"].dt.to_period("M").dt.to_timestamp()
summary = (sales
.groupby(["region","month"], as_index=False)
.agg(revenue_sum=("revenue","sum"), orders=("order_id","count"))
)
print(summary)What you get: monthly revenue and order counts per region for reporting.
2) Cleaning missing values and fixing types
import pandas as pd
raw = pd.DataFrame({
"user_id": [101, 102, 103, 104],
"age": ["25", None, " 31 ", "unknown"],
"joined": ["2023-10-01", "2023-11-15", None, "2024/01/03"]
})
# Clean age
raw["age"] = (raw["age"].replace("unknown", pd.NA)
.str.strip()
.pipe(pd.to_numeric, errors="coerce"))
# Parse dates
raw["joined"] = pd.to_datetime(raw["joined"], errors="coerce")
# Fill missing age with median, drop rows with no join date
raw["age"] = raw["age"].fillna(raw["age"].median())
clean = raw.dropna(subset=["joined"])
print(clean)What you get: robust cleaning using to_numeric, to_datetime, fillna, and dropna.
3) Customer retention cohort
import pandas as pd
orders = pd.DataFrame({
"user_id": [1,1,2,2,3,3,3,4],
"order_date": pd.to_datetime(["2024-01-10","2024-02-05","2024-01-20","2024-03-01","2024-02-12","2024-03-02","2024-04-01","2024-02-25"])
})
orders["cohort"] = orders.groupby("user_id")["order_date"].transform("min").dt.to_period("M")
orders["month"] = orders["order_date"].dt.to_period("M")
ret_table = (orders.assign(lag=(orders["month"] - orders["cohort"]).apply(lambda x: x.n))
.groupby(["cohort","lag"], as_index=False)
.agg(users=("user_id", pd.Series.nunique))
.pivot_table(index="cohort", columns="lag", values="users", fill_value=0)
)
print(ret_table)What you get: a cohort table showing user repeat behavior over months.
4) Joining product info and reshaping for a dashboard
import pandas as pd
orders = pd.DataFrame({"order_id":[1,2,3,4], "product_id":[10,10,11,12], "qty":[2,1,3,5]})
products = pd.DataFrame({"product_id":[10,11,12], "category":["A","B","A"], "price":[50,80,20]})
joined = orders.merge(products, on="product_id", how="left")
joined["revenue"] = joined["qty"] * joined["price"]
pivot = joined.pivot_table(index="category", values=["qty","revenue"], aggfunc="sum")
print(pivot)What you get: tidy summaries for categories after a simple merge and pivot.
5) Fast string cleaning at scale
import pandas as pd
names = pd.Series([" Alice ", "bob", None, "CHARLIE "])
clean = (names
.fillna("")
.str.strip()
.str.title()
)
print(clean)What you get: vectorized string operations for speed and clarity.
6) Time-series resampling
import pandas as pd
rng = pd.date_range("2024-01-01", periods=6, freq="D")
ts = pd.DataFrame({"timestamp": rng, "value": [1,2,3,6,2,5]}).set_index("timestamp")
weekly = ts.resample("W").sum()
print(weekly)What you get: cohesive weekly totals from daily data.
Drills and exercises
- Load a CSV, inspect head(), info(), describe(), and value_counts() on a key column.
- Filter rows by multiple conditions (AND/OR), then sort by two columns.
- Convert a messy numeric column from strings (with commas) to numeric and fill missing values.
- Create two derived columns using assign and where for conditional logic.
- Group by two keys and compute mean, median, and count in one go; add a transform-based z-score.
- Merge two tables with a left join; validate the row count and check for unexpected nulls.
- Reshape a long table to wide with pivot_table, then back to long with melt.
- Profile memory usage, convert to category where appropriate, and measure improvement.
- Process a 2M-row CSV in chunks; compute a running aggregation and write a clean output.
Common mistakes and debugging tips
- SettingWithCopyWarning: Happens when modifying a view. Use .loc and assign back: df.loc[mask, "col"] = value.
- Unexpected NaNs after merge: Keys don’t match exactly (types/whitespace). Normalize types and strip strings before joining.
- Slow loops: Prefer vectorized ops or .map/.apply on Series; avoid row-wise loops where possible.
- Timezone confusion: Make times timezone-aware early (dt.tz_localize/dt.tz_convert) and be consistent.
- Wrong groupby results: Confirm group columns are the expected dtype (e.g., category) and check as_index parameter.
- Memory blowups: Downcast numerics (to_numeric with downcast), convert to category, or read/process in chunks.
Quick debugging mini-tasks
- Print df.columns.tolist() to confirm exact column names after reading files.
- Use df.sample(5, random_state=0) to inspect representative rows.
- Validate invariants with assertions: assert df["price"].ge(0).all()
- Use .pipe() to chain steps and print shapes after each transformation.
Mini project: Weekly revenue monitor
Goal: Build a reproducible pipeline that loads orders, enriches with product data, cleans fields, and outputs a weekly revenue and order count report.
Steps
- Read orders (CSV) and products (CSV/Excel); assert required columns exist.
- Clean: parse dates, fix numeric types; fill missing product prices with median (and flag).
- Join: left-join products; compute revenue and a discounted_revenue column (where if category == "A" apply 5% discount).
- Aggregate: resample weekly; compute orders, revenue, discounted_revenue; add WoW percent change.
- Validate: no negative revenues; weeks are non-empty; export to CSV and Parquet.
- Plot: basic line plot of weekly revenue using pandas .plot().
import pandas as pd
orders = pd.read_csv("orders.csv", parse_dates=["order_date"])
products = pd.read_csv("products.csv")
required_o = {"order_id","product_id","qty","order_date"}
required_p = {"product_id","price","category"}
assert required_o.issubset(orders.columns)
assert required_p.issubset(products.columns)
orders["qty"] = pd.to_numeric(orders["qty"], errors="coerce").fillna(0).astype(int)
products["price"] = pd.to_numeric(products["price"], errors="coerce")
products["price"] = products["price"].fillna(products["price"].median())
joined = orders.merge(products, on="product_id", how="left")
joined["revenue"] = joined["qty"] * joined["price"]
joined["discounted_revenue"] = joined["revenue"].where(joined["category"] != "A", joined["revenue"] * 0.95)
weekly = (joined.set_index("order_date")
.resample("W")
.agg(orders=("order_id","count"), revenue=("revenue","sum"), disc_revenue=("discounted_revenue","sum"))
)
weekly["wow_revenue_pct"] = weekly["revenue"].pct_change() * 100
assert (weekly[["revenue","disc_revenue"]] >= 0).all().all()
weekly.to_csv("weekly_report.csv", index=True)
weekly.to_parquet("weekly_report.parquet")
print(weekly.tail())Subskills
Explore focused subskills to master pandas efficiently. Open each, practice, and return for the exam.
- Installing and Environment Setup: Create a virtual env, install pandas, set up Jupyter/VS Code; run a quick import test.
- Reading and Writing Data (CSV, Excel, JSON, Parquet, SQL): Load multiple formats, handle encodings, and export cleaned results.
- Understanding DataFrame and Series Basics: Navigate core objects, indexes, and fundamental methods.
- Indexing and Selection (loc/iloc/at/iat): Select rows/cols safely and efficiently.
- Boolean Filtering and query: Build clear, chainable row filters.
- Sorting and Ranking: Order data and compute dense/rank methods.
- Handling Missing Values (isna/fillna/dropna/interpolation): Diagnose and treat missingness confidently.
- Data Types and Type Conversion (astype/to_numeric/to_datetime): Ensure correct dtypes for accurate calculations.
- String Operations with .str accessor: Clean text at scale with vectorized functions.
- Datetime Operations, Timezones, Resampling: Parse, localize/convert, and aggregate time-series.
- Creating and Transforming Columns (assign/where): Build readable, testable transformation steps.
- apply/map/applymap, Vectorization Basics: Choose the right tool for speed and clarity.
- groupby Aggregations, Multiple Aggregations, transform, filter: Create KPIs and cohort-like metrics.
- Window (rolling/expanding/ewm): Compute moving averages and trends.
- Merge/Join, concat, append patterns: Combine datasets without surprises.
- Reshaping (melt/pivot/pivot_table/stack/unstack): Switch between long and wide formats.
- Working with MultiIndex: Manage hierarchical indexes for complex analyses.
- Handling Categorical Data and category dtype: Optimize memory and enforce valid values.
- Duplicates, unique, nunique: Detect and resolve duplication.
- Binning with cut/qcut: Create buckets for segmentation.
- value_counts, crosstab: Quick distributions and contingency tables.
- Working with Lists and Dicts (explode, json_normalize): Tame nested/array data.
- Working with Missing and Nullable dtypes: Use pandas' nullable types for consistency.
- Memory Usage and Optimization: Fit more data into RAM safely.
- Performance Profiling and Speedups: Measure and improve runtimes.
- Chunking and Streaming Large Files: Process big data incrementally.
- Data Validation with Assertions: Catch issues early with checks.
- Exporting Clean Outputs and Reproducible Pipelines: Deliver reliable artifacts and pipelines.
- Basic Plotting with pandas: Create quick visuals for reports.
- Interop with NumPy and Matplotlib: Extend pandas with scientific Python.
Next steps
- Work through the subskills below and complete the drills.
- Build the mini project and save your pipeline.
- Take the skill exam. Anyone can attempt; logged-in users get progress saving.