Skill Not Found

Why pandas matters for a Product Analyst

Pandas is the go-to Python library for turning raw product data into clear answers. As a Product Analyst, you will explore event logs, compute funnels and retention, analyze experiments, build cohort tables, and automate weekly dashboards. Pandas gives you fast data wrangling, flexible grouping, and powerful time handling on your laptop—no waiting for long queries or external tools.

What you'll be able to do

Clean and standardize product event logs (timestamps, user IDs, device/platform fields).
Aggregate at user, session, and cohort levels to compute metrics like DAU/WAU/MAU.
Build funnels, retention curves, and cohort tables fully in pandas.
Run quick experiment reads: group results, compute effects, and visualize distributions.
Automate recurring product reports with reusable functions and schedules.

Who this is for

Aspiring or current Product Analysts who need hands-on data work without heavy infrastructure.
Analysts comfortable with spreadsheets or SQL who want faster iteration and automation.
PMs and Data-savvy teammates who want reproducible analysis and fewer manual steps.

Prerequisites

Basic Python (variables, lists/dicts, functions).
Comfort with CSVs/TSVs and data types.
Optional but useful: SQL basics and product metrics concepts (DAU, retention, funnels).

Learning path

Load and inspect data — Use read_csv, dtypes, .head(), .info(), .describe().
Tip
Start with a small sample using nrows=50_000 or use usecols to limit columns. Validate unique keys (user_id, event_id).
Clean event logs — Parse timestamps, standardize strings, handle missing values, deduplicate.
Tip
Use to_datetime with utc=True, and .str.lower().str.strip() for categorical normalization.
Aggregate metrics — groupby for DAU/WAU/MAU, user-level features, and per-segment metrics.
Tip
Use .dt.date or pd.Grouper(freq='D') for time-based grouping. Cache intermediate DataFrames.
Funnels and retention — Order events, build steps, compute conversion and drop-offs; generate cohort tables.
Tip
Index events by user and time before step logic. Use pivot_table for cohort matrices.
Experiment analysis — Assign variants, check balance, compute lift with CIs.
Tip
Use groupby on variant; for proportions, use standard error sqrt(p(1-p)/n).
Automate reports — Wrap logic in functions, parameterize dates, save outputs to CSV/XLSX, and schedule.

Worked examples

1) Load and clean a product event log

import pandas as pd

# Load only needed columns for speed
cols = ["event_id", "user_id", "event_name", "ts", "platform", "country"]
df = pd.read_csv("events.csv", usecols=cols)

# Parse timestamp and normalize categoricals
df["ts"] = pd.to_datetime(df["ts"], utc=True, errors="coerce")
df["event_name"] = df["event_name"].str.lower().str.strip()
df["platform"] = df["platform"].str.lower().str.strip().fillna("unknown")
df = df.dropna(subset=["ts", "user_id"])  # drop broken rows

# Deduplicate exact duplicates (rare but safer)
df = df.drop_duplicates(subset=["event_id"])  # assumes event_id is unique

print(df.info())

Outcome: a consistent, time-aware DataFrame ready for grouping.

2) DAU/WAU and basic segmentation

# Daily active users (DAU) overall and by platform
df["date"] = df["ts"].dt.date

dau = df.groupby("date")["user_id"].nunique().rename("dau")

dau_by_platform = df.groupby(["date", "platform"])\
                     ["user_id"].nunique()\
                     .reset_index(name="dau")

# Rolling 7-day active users (WAU) from daily uniques
wau = dau.rolling(7).sum().rename("wau")

print(dau.tail())
print(wau.tail())

Outcome: simple active metrics, often the first chart in a weekly review.

3) User-level features and D1 retention

# First seen date per user
first_seen = df.groupby("user_id")["ts"].min().dt.date.rename("signup_date")

# Day-level activity table
active = df.drop_duplicates(["user_id", "date"])\
           [["user_id", "date"]]

# Join and compute D1 retained: active on signup_date + 1 day
users = first_seen.to_frame().reset_index()
active_users = active.merge(users, on="user_id", how="left")
active_users["d_from_signup"] = (
    pd.to_datetime(active_users["date"]) - pd.to_datetime(active_users["signup_date"]) 
).dt.days

ret_d1 = (
    active_users.assign(d1=lambda x: (x["d_from_signup"] == 1).astype(int))
    .groupby("signup_date")["d1"].mean()
    .rename("retention_d1")
)

print(ret_d1.head())

Outcome: user-level joins to compute D1 retention per signup cohort.

4) Funnel conversion with drop-offs

# Define step order
good_events = df[df["event_name"].isin(["view_item", "add_to_cart", "checkout", "purchase"])].copy()

# Rank events per user and step
good_events["step_order"] = pd.Categorical(
    good_events["event_name"], 
    categories=["view_item", "add_to_cart", "checkout", "purchase"],
    ordered=True
)

# Keep first occurrence of each step per user
first_steps = (good_events.sort_values(["user_id", "ts"]) 
                         .drop_duplicates(["user_id", "event_name"]))

# Who reached each step
reach = first_steps.pivot_table(index="user_id", 
                                columns="event_name", 
                                values="ts", aggfunc="min")

n_view = reach["view_item"].notna().sum()
n_add = (reach["view_item"].notna() & reach["add_to_cart"].notna()).sum()
n_chk = (reach["add_to_cart"].notna() & reach["checkout"].notna()).sum()
n_buy = (reach["checkout"].notna() & reach["purchase"].notna()).sum()

conv_view_add = n_add / n_view if n_view else 0
conv_add_chk = n_chk / n_add if n_add else 0
conv_chk_buy = n_buy / n_chk if n_chk else 0

print({
  "viewers": int(n_view), 
  "added": int(n_add), 
  "checkout": int(n_chk), 
  "purchased": int(n_buy),
  "step_conv": [conv_view_add, conv_add_chk, conv_chk_buy]
})

Outcome: a clear funnel with conversion rates and drop-offs at each step.

5) Cohort table (signup month x retention week)

# Cohort by signup month and active week offset
first_seen = df.groupby("user_id")["ts"].min().rename("signup_ts")
users = first_seen.to_frame().reset_index()
users["cohort_month"] = users["signup_ts"].dt.to_period("M").dt.to_timestamp()

# Active week for each activity record
active = df.drop_duplicates(["user_id", "date"]).copy()
active = active.merge(users[["user_id", "signup_ts", "cohort_month"]], on="user_id", how="left")
active["week_active"] = active["ts"].dt.to_period("W").dt.start_time
active["week_signup"] = active["signup_ts"].dt.to_period("W").dt.start_time
active["week_index"] = ((active["week_active"] - active["week_signup"]) / pd.Timedelta(weeks=1)).astype(int)

cohort_pivot = (active.groupby(["cohort_month", "week_index"]) 
                        ["user_id"].nunique()
                        .unstack(fill_value=0))

# Normalize by cohort size
cohort_size = cohort_pivot[0]
retention = cohort_pivot.divide(cohort_size, axis=0)
print(retention.iloc[:, :6].round(3))  # first 6 weeks

Outcome: a standard retention matrix per signup cohort.

6) A/B experiment quick read

# Assume we have one row per user with assignment and outcome
exp = pd.read_csv("experiment_users.csv")  # columns: user_id, variant, converted

summary = exp.groupby("variant").agg(
    n=("user_id", "count"), 
    conv_rate=("converted", "mean")
)

# Approximate standard error and 95% CI for proportions
summary["se"] = (summary["conv_rate"] * (1 - summary["conv_rate"]) / summary["n"]) ** 0.5
summary["ci95_low"] = summary["conv_rate"] - 1.96 * summary["se"]
summary["ci95_high"] = summary["conv_rate"] + 1.96 * summary["se"]

# Lift vs control
control = summary.loc["control", "conv_rate"]
summary["lift_pct"] = (summary["conv_rate"] / control - 1) * 100

print(summary.round(4))

Outcome: quick lift and confidence intervals to inform a decision.

Drills and exercises

Load a 1M-row CSV with only needed columns and correct dtypes.
Convert all timestamps to UTC; extract date and week start.
Compute DAU/WAU by platform and top 5 countries.
Build a 4-step funnel and report step conversions and drop-offs.
Create D1/D7 retention for last 8 signup cohorts.
Run a quick A/B read with lift and 95% CIs; check sample ratio mismatch.
Export final tables to CSV and an Excel workbook with multiple sheets.

Common mistakes and debugging tips

Timezone drift: Mixing naive and timezone-aware datetimes. Fix by using to_datetime(..., utc=True) and converting to local only for display.
Double counting users: Counting events instead of unique users for DAU. Use nunique() on user_id.
Funnel order issues: Not sorting by timestamp before deduping steps. Always sort by [user_id, ts].
Cohort leakage: Using calendar week instead of weeks since signup. Compute a relative week_index from first activity.
Wrong experiment metrics: Comparing means with outliers for heavy-tailed revenue. Consider median, trimmed mean, or winsorization; report variance.
Memory errors: Loading all columns or using object dtype everywhere. Use usecols, dtype mappings, and category for low-cardinality strings.

Quick debugging checklist

Print df.head(), df.tail(), and df.sample(5) after major transforms.
Check df.isna().mean() by column to spot missing data early.
Validate groupby results by cross-checking small user samples.
Write assertion guards, e.g., assert df["user_id"].notna().all().

Mini project: Weekly Product Metrics Report

Goal: Produce a weekly report that includes DAU/WAU, a key funnel, D1/D7 retention, and an ongoing experiment read. Save CSVs and a single Excel workbook.

Write a loader that reads the last 90 days of events with usecols and proper dtypes.
Clean and normalize event_name, platform, country; ensure unique event_id.
Compute DAU/WAU, funnel steps, and retention cohorts.
If experiment assignment exists, compute conversion, lift, and CI.
Export outputs to ./out with timestamps in filenames and an Excel summary file.

Deliverables

dau.csv, wau.csv
funnel.csv (counts and conversion per step)
retention.csv (cohort x week matrix)
experiment_summary.csv and a combined metrics.xlsx

Practical projects

Self-serve product KPI toolkit: a set of reusable functions for DAU/WAU/MAU, funnels, retention.
Experiment dashboard generator: read assignments and outcomes, export a one-pager per test.
Launch analysis: cohort tables and funnel changes before/after a feature rollout.

Subskills

Product Data Extraction — Efficiently load, subset, and type-cast raw product data for fast analysis.
Event Log Cleaning — Normalize timestamps and strings, deduplicate, and validate schema.
User Level Aggregations — Build user/session features and daily/weekly active metrics.
Funnel And Retention Computation — Create step funnels and D1/D7 retention curves reliably.
Cohort Tables In Pandas — Generate cohort matrices and normalize by cohort size.
Experiment Analysis Pipelines — Assign variants, compute lift, and check experiment health.
Metric Decomposition — Break changes into mix, rate, and volume effects to explain deltas.
Automation Of Product Reports — Package logic into functions and scheduled exports.

Next steps

Work through the subskills below in order.
Complete the mini project and reuse its functions in future work.
Take the skill exam to validate your readiness. Anyone can take it; saved progress is for logged-in users.

Menu

Python pandas

Table of Contents

Why pandas matters for a Product Analyst

What you'll be able to do

Who this is for

Prerequisites

Learning path

Worked examples

Drills and exercises

Common mistakes and debugging tips

Mini project: Weekly Product Metrics Report

Practical projects

Subskills

Next steps

Python pandas — Skill Exam

Topics

Metric Decomposition

Automation Of Product Reports

Experiment Analysis Pipelines

Product Data Extraction

Event Log Cleaning

User Level Aggregations

Funnel And Retention Computation

Cohort Tables In Pandas

Have questions about Python pandas?

AI Assistant