luvv to helpDiscover the Best Free Online Tools

Python (pandas and numpy)

Learn Python (pandas and numpy) for Data Scientist for free: roadmap, examples, subskills, and a skill exam.

Published: January 1, 2026 | Updated: January 1, 2026

Why this skill matters for a Data Scientist

Pandas and NumPy are your daily drivers for turning raw data into reliable, model-ready features. They let you ingest files and databases quickly, clean and reshape data, engineer features, and perform vectorized numerical computation at scale. Mastering these tools makes you faster, more accurate, and more reproducible—key expectations for a Data Scientist.

  • Load: Read CSV/Parquet/SQL efficiently with correct dtypes.
  • Wrangle: Filter, join, pivot, group, and aggregate without loops.
  • Compute: Use NumPy broadcasting and vectorization for speed.
  • Time series: Parse dates, resample, and roll windows correctly.
  • Quality: Handle missing data, detect outliers, and avoid chained indexing bugs.
  • Reproducibility: Write reusable helpers and keep notebooks deterministic.

What you will be able to do

  • Load multi-GB datasets with the right dtypes and memory footprint.
  • Transform messy raw data into tidy, analyzable tables.
  • Engineer features for ML (encodings, aggregations, time-based features).
  • Diagnose issues quickly using .info(), .isna(), .value_counts(), and profiling patterns.
  • Write small, tested utility functions you can reuse across notebooks.

Practical roadmap

1) Set up and warm-up (0.5–1 hour)

  • Install Python 3, Jupyter, pandas, numpy. Create a new notebook.
  • Practice reading a small CSV and inspecting .head(), .shape, .dtypes.

2) Data loading and typing (1–2 hours)

  • read_csv with parse_dates, dtype, usecols. Save to Parquet for speed.
  • Compare memory via df.memory_usage(deep=True).sum().

3) Wrangling and joins (2–3 hours)

  • Filter, sort, rename, assign, pipe. Avoid chained indexing.
  • Use merge, concat, groupby-agg, pivot_table, melt.

4) NumPy vectorization (1–2 hours)

  • Understand shapes, broadcasting, boolean masks, where, einsum basics.
  • Replace loops with vectorized expressions.

5) Missing data and time series (1–2 hours)

  • isna, fillna, ffill/bfill per group, dropna pitfalls.
  • DatetimeIndex, resample, rolling, timezone awareness.

6) Features, performance, and reuse (2–3 hours)

  • Feature matrix construction with get_dummies and groupby features.
  • Optimize with category dtype, vectorization, and .loc.
  • Extract logic into functions with type hints and tests.
Tip: When to switch from loops to vectorization

If a loop touches every row, there’s almost always a vectorized equivalent using masks, arithmetic, or groupby. Aim for one expression per transformation.

Worked examples

1) Read CSV safely and efficiently

import pandas as pd

# Good: explicit dtypes and date parsing
schema = {
    "user_id": "int64",
    "country": "category",
    "amount": "float64"
}
df = pd.read_csv(
    "transactions.csv",
    usecols=["user_id", "country", "amount", "ts"],
    dtype=schema,
    parse_dates=["ts"]
)

print(df.dtypes)
print("MB:", round(df.memory_usage(deep=True).sum() / 1e6, 2))
Why this works

By setting dtypes and parsing dates, you avoid implicit type guessing, reduce memory, and get correct time semantics.

2) Vectorize a conditional calculation with NumPy

import numpy as np

x = np.array([10, 20, 30, 40])
mask = x >= 25
# If x>=25, add 5 else subtract 5
y = np.where(mask, x + 5, x - 5)
print(y)  # [ 5 15 35 45]
Key idea

np.where applies the condition to the whole array at once—no Python loops, much faster on large arrays.

3) Groupby with multiple aggregations

import pandas as pd

# Suppose df has columns: user_id, amount, ts (datetime)
by_user = (
    df.groupby("user_id")["amount"]
      .agg(txn_count="count", total_spend="sum", avg_spend="mean")
      .reset_index()
)
print(by_user.head())
Why reset_index()

Resetting gives a flat table, convenient for merges and model features.

4) Forward-fill missing values within each user

df = df.sort_values(["user_id", "ts"])  # critical before ffill
df["amount_ffill"] = (
    df.groupby("user_id")["amount"].ffill()
)
# Optionally fill remaining NaN (e.g., first row per group)
df["amount_ffill"] = df["amount_ffill"].fillna(0.0)
Common mistake

Forgetting to sort by the grouping key and time column before ffill leads to wrong fills.

5) Daily resample of event counts

# Ensure datetime is the index
s = df.set_index("ts").assign(count=1)["count"]
per_day = s.resample("D").sum().fillna(0)
print(per_day.tail())
When not to resample

Resample requires a DatetimeIndex. If you have per-user time series, resample after splitting or use groupby with pd.Grouper.

Drills and micro-exercises

  • Load a CSV with 1M+ rows, set at least two columns to category, and report memory before/after.
  • Replace a Python for-loop transformation with a vectorized NumPy or pandas expression.
  • Implement a per-group rolling 7-day sum using groupby + rolling.
  • Create at least 3 features with groupby-agg (count, mean, unique count) and merge back.
  • Prove you avoided chained indexing by making one .loc assignment that updates in place.

Common mistakes and debugging tips

  • Chained indexing: df[df.a > 0]["b"] = 1 silently fails. Use df.loc[df.a > 0, "b"] = 1.
  • Wrong dtypes: Objects for numbers slow everything. Fix with to_numeric and category for low-cardinality strings.
  • Missing data leakage: Using fillna with global means can leak target information. Compute fills within train folds or by group.
  • Time chaos: Forgetting timezone conversion or not sorting before resample/ffill produces incorrect sequences.
  • Loop over rows: Avoid iterrows/apply row-wise for large data; prefer vectorization or groupby.
Debug like a pro
  • df.info(), df.head(), df.tail() for quick structure checks.
  • df.sample(5, random_state=0) to spot anomalies fast.
  • df.isna().mean() to see missingness by column.
  • df["col"].value_counts(dropna=False).head(20) for categorical distributions.

Goal: Build a tidy feature matrix for predicting whether a user will purchase next week.

  1. Load transactions (user_id, ts, amount, category) with proper dtypes and parse ts.
  2. Create weekly aggregates per user: count, sum, mean, unique categories.
  3. Engineer recency (days since last purchase) and rolling 4-week spend.
  4. Join user attributes (join on user_id). Ensure left joins preserve all users.
  5. Handle missing: ffill per user where sensible; otherwise fill with zeros.
  6. Construct X with numeric features and one-hot encodings for key categories.
  7. Export to Parquet (features.parquet) for fast reuse.
Deliverables checklist
  • A notebook with reusable helper functions (typed, documented).
  • A features DataFrame with clear, snake_case columns.
  • A short note on data quality issues and how you handled them.

Who this is for

  • Aspiring or practicing Data Scientists who need strong data wrangling speed.
  • Analysts and ML Engineers transitioning into model development.
  • Anyone working with medium-to-large tabular or time series data.

Prerequisites

  • Basic Python (functions, lists, dicts, modules).
  • Comfort with Jupyter or a Python IDE.
  • Basic statistics (mean, median) helpful but not required.

Learning path

  1. Foundation: Data loading, dtypes, memory.
  2. Wrangling essentials: filtering, joins, pivots, groupby.
  3. NumPy vectorization: broadcasting, masks, reductions.
  4. Missing/time series: ffill/bfill, resample, rolling.
  5. Feature matrix: encodings, aggregations, leakage-aware splits.
  6. Performance and reuse: categories, .loc, functions, reproducible notebooks.

Practical projects to cement skills

  • Sales analytics: Build weekly store-level KPIs and a feature matrix for sales forecasting.
  • Web logs: Sessionize clickstream data and compute per-user engagement metrics.
  • Sensor time series: Resample, interpolate, and compute rolling anomaly scores.

Next steps

  • Practice on a real dataset (100k+ rows). Time your operations; optimize.
  • Refactor your notebook logic into a small utility module for reuse.
  • Take the skill exam below to validate mastery. Exam is free; logged-in users get saved progress.

Python (pandas and numpy) — Skill Exam

15 questions, ~25 minutes. Passing score: 70%. You can take this exam for free. If you are logged in, your progress and score will be saved; otherwise, you can still complete the exam, but progress won't be stored.Tips: Prefer vectorized answers, watch for time series index requirements, and avoid chained indexing.

14 questions70% to pass

Have questions about Python (pandas and numpy)?

AI Assistant

Ask questions about this tool