Why this matters
As a Product Analyst, you often need to deliver daily or weekly dashboards: signups, activation, conversion, retention, revenue by channel. Automating these with pandas saves hours, reduces manual errors, and makes results reproducible.
- Daily metrics rollups (DAU, events, conversion).
- Weekly channel performance and CAC proxy tracking.
- Cohort summaries and retention tables for stakeholders.
- Automated anomaly flags to catch data issues early.
Note: Everyone can take the exercises and quick test for free. Only logged-in users have their progress saved.
Concept explained simply
Think of automation as a small factory:
- Inputs: raw files or queries (CSV, Parquet, API, SQL extract).
- Transforms: clean, validate, aggregate in pandas.
- Outputs: CSV/Excel files or charts saved to a folder.
- Schedule: run daily/weekly with consistent parameters.
Mental model
Make the pipeline deterministic: the same input produces the same output. Parameterize the date and period, keep functions small, and add checks to fail fast when data looks wrong.
A practical workflow
- Define KPIs and columns needed (inputs).
- Write a small loader that picks the correct files (by date or latest).
- Transform with pure pandas (groupby, pivot_table, joins).
- Validate (row counts, unique keys, non-null rates).
- Export tidy outputs with clear filenames that include the date.
- Schedule and log basic run info (start/end, counts).
Worked examples
Example 1 — Daily metrics report (DAU, signups, conversion)
Goal: Generate a daily CSV/Excel with DAU, events, signups, and conversion.
import pandas as pd
from pathlib import Path
from datetime import date
# 1) Load latest events file
# Expect columns: user_id, event, ts (timestamp ISO)
files = sorted(Path("data").glob("events_*.csv"))
assert files, "No events files found in data/"
raw = pd.read_csv(files[-1])
# 2) Basic cleanup and typing
raw["ts"] = pd.to_datetime(raw["ts"], errors="coerce")
raw = raw.dropna(subset=["user_id", "event", "ts"]) # fail fast on malformed rows
raw["date"] = raw["ts"].dt.date
# 3) Aggregate
metrics = (
raw.groupby("date").agg(
dau=("user_id", "nunique"),
events=("event", "count"),
signups=("event", lambda s: (s == "signup").sum())
).reset_index()
)
metrics["conv_rate"] = (metrics["signups"] / metrics["dau"]).fillna(0).round(4)
# 4) By-event shape (optional breakdown)
by_event = (
raw.pivot_table(
index="date",
columns="event",
values="user_id",
aggfunc=pd.Series.nunique,
fill_value=0
).reset_index()
)
# 5) Validations
assert metrics["dau"].ge(0).all(), "DAU must be non-negative"
assert metrics["date"].is_unique, "Duplicate date aggregation detected"
# 6) Export
run_date = date.today().isoformat()
metrics.to_csv(f"out/daily_metrics_{run_date}.csv", index=False)
by_event.to_csv(f"out/daily_by_event_{run_date}.csv", index=False)
print("Saved daily reports for", run_date)
Result: two files in out/ with today’s date. Idempotent: re-running the same day rewrites the same files.
Example 2 — Weekly channel performance
Goal: Combine sessions with campaign UTM tags to summarize weekly users and signup rate by channel.
import pandas as pd
# Load (examples; adapt paths)
sessions = pd.read_csv("data/sessions.csv") # user_id, started_at, utm_source
users = pd.read_csv("data/users.csv") # user_id, signup_at
# Types
sessions["started_at"] = pd.to_datetime(sessions["started_at"], errors="coerce")
users["signup_at"] = pd.to_datetime(users["signup_at"], errors="coerce")
# Join to know who signed up
df = sessions.merge(users[["user_id", "signup_at"]], on="user_id", how="left")
df["week"] = df["started_at"].dt.to_period("W-SUN").dt.start_time
# Weekly by channel
weekly = (
df.groupby(["week", "utm_source"]).agg(
unique_users=("user_id", "nunique"),
signups=("signup_at", lambda s: s.notna().sum())
).reset_index()
)
weekly["signup_rate"] = (weekly["signups"] / weekly["unique_users"]).fillna(0).round(4)
# Basic checks
assert set(["week", "utm_source", "unique_users", "signups", "signup_rate"]).issubset(weekly.columns)
weekly.to_csv("out/weekly_channel.csv", index=False)
Example 3 — Simple anomaly flags (day-over-day)
Goal: Flag suspicious drops or spikes in DAU beyond a threshold.
import pandas as pd
metrics = pd.read_csv("out/daily_metrics_2025-01-01.csv") # or the latest produced
metrics["date"] = pd.to_datetime(metrics["date"])
metrics = metrics.sort_values("date")
metrics["dau_dod_pct"] = metrics["dau"].pct_change().round(4)
threshold = 0.3 # 30% change
metrics["flag_anomaly"] = metrics["dau_dod_pct"].abs().gt(threshold)
metrics.to_csv("out/daily_metrics_with_flags.csv", index=False)
This adds a boolean column flagging days to review.
Reusable snippets
- Load latest file in a folder: use Path.glob and pick the last after sort.
- Parameterize dates: accept a date string or default to today/yesterday.
- Consistent exports: include date in filenames and keep tidy column names.
- Validations: assert expected columns exist; assert unique keys; check row counts.
from datetime import date, timedelta
def default_run_date():
return (date.today() - timedelta(days=1)).isoformat()
Quality checks that save you
- Schema check: expected_cols ⊆ df.columns; if missing, stop.
- Key uniqueness: per-day aggregates should have unique dates.
- Range checks: metrics not negative; rates in [0,1].
- Row-volume sanity: sudden 90% drop likely a data issue.
Scheduling options (lightweight)
- Run the script daily/weekly using your OS scheduler.
- Log start/end time and written filenames to a simple text log.
- Keep scripts idempotent: same input, same output filenames.
Exercises
Complete these tasks to practice. The same tasks appear below in the Exercises section with expected outputs and solutions.
- Exercise 1: Build a daily metrics automation that produces two CSVs: overall daily metrics and by-event breakdown, with clear date-stamped filenames.
- Exercise 2: Turn your code into a reusable function with input validations and a date parameter (default to yesterday). Write one CSV named using the provided date.
- Checklist:
- Uses pandas groupby or pivot_table.
- Includes at least two validation checks.
- Outputs have deterministic, date-stamped filenames.
- Script is re-runnable without manual edits.
Common mistakes and self-check
- Mixing local time and UTC: convert timestamps once, consistently.
- Appending day after day to one file without dedupe: prefer one file per date or a deduplicated append.
- Silent schema drift: always assert expected columns.
- Non-idempotent filenames like report_final.csv: include the run date instead.
- Forgetting null handling before groupby: drop or fill sensibly.
Self-check: If you run your script twice with the same input, do you get identical outputs and counts? If not, fix parameters or filenames.
Practical projects
- Automated daily activation report (signup to first key action within 7 days), exported to CSV.
- Weekly channel dashboard with CAC proxy (spend CSV + attributed signups) and rates.
- Retention snapshot: cohort by signup month, week 1/2/4 retention table with basic checks.
Mini challenge
Create a single command that, given a date (YYYY-MM-DD), produces three outputs: daily_metrics_DATE.csv, daily_by_event_DATE.csv, and daily_metrics_with_flags_DATE.csv. Include at least three validations. Keep it idempotent.
Hint path
- Write a function run(date_str) and use it in if __name__ == "__main__".
- Separate load, transform, validate, and export into small functions.
- Use try/except to print a friendly message and exit non-zero on validation failure.
Who this is for
- Product Analysts who need reliable, repeatable reporting.
- Data-savvy PMs or analysts transitioning from spreadsheets to code.
Prerequisites
- Comfort with Python basics (functions, files, dates).
- Pandas essentials (DataFrame, groupby, merge, pivot_table).
- Basic understanding of your product’s events and KPIs.
Learning path
- Parameterize and validate: make scripts deterministic.
- Aggregate and reshape: groupby and pivot_table fluency.
- Export and schedule: produce tidy outputs on a cadence.
- Add anomaly flags and basic run logs.
Next steps
- Add more validations (duplicate user_id-date checks, min/max thresholds).
- Bundle scripts into a small CLI (argparse) for teammates to use.
- Later, consider orchestration tools once your logic stabilizes.