Why this matters
As a Product Analyst, you will repeatedly evaluate A/B and multivariate experiments. A solid pandas-based pipeline lets you go from raw logs to trustworthy decisions quickly and consistently.
- Answer product questions: Did the new onboarding increase activation?
- Guardrail checks: Did error rates or latency regress?
- Consistent decisions: Reuse a vetted pipeline instead of one-off notebooks.
Concept explained simply
An experiment analysis pipeline is a repeatable set of steps that transforms raw exposure and event data into clean, comparable metrics with statistical uncertainty. Think: same recipe, different ingredients.
Mental model
Picture a conveyor belt:
- Specify: hypothesis, unit of randomization, success metrics, guardrails.
- Ingest: exposures (who saw what, when) + events (what they did, when).
- Clean: de-duplicate users, enforce exposure windows, filter bots/tests.
- Join: align events to each user’s first exposure.
- Aggregate: compute per-user metrics, then group-level summaries.
- Validate: SRM and sanity checks.
- Infer: effect sizes, confidence intervals, variance reduction if needed.
- Report: readable tables and a one-line recommendation.
The standard pipeline (step-by-step)
- Define the unit and exposure event (e.g., user-level; exposure when assignment cookie set).
- Load data into pandas: exposures and behavioral events.
- Keep the first valid exposure per user; exclude test traffic and bots if flagged.
- Join outcomes within a fixed window (e.g., 7 days after exposure_ts).
- Create per-user features: converted flag, revenue, sessions, errors.
- Aggregate by variant: sample size, conversion rate, revenue per user, guardrails.
- Run validity checks: SRM (sample imbalance), assignment leakage, date overlaps.
- Estimate uncertainty: bootstrap or analytical CIs; optionally apply CUPED.
- Summarize: lift vs control, CI, practical interpretation.
Worked examples
Example 1: Build an exposure→outcome dataset
Show data and code
# Sample exposures
# user_id,variant,exposure_ts
# 1,control,2023-01-01 10:00:00
# 2,control,2023-01-01 10:05:00
# 3,variant,2023-01-01 11:00:00
# 4,variant,2023-01-02 09:00:00
# 5,control,2023-01-02 12:00:00
# 3,control,2023-01-03 12:00:00 # duplicate user, keep first exposure only
# Sample events
# user_id,event,amount,event_ts
# 1,view,0,2023-01-01 10:01:00
# 1,purchase,30,2023-01-02 08:00:00
# 2,view,0,2023-01-01 10:06:00
# 3,view,0,2023-01-01 11:05:00
# 3,purchase,20,2023-01-08 12:00:00 # outside 7-day window by ~1 hour
# 4,view,0,2023-01-02 09:10:00
# 4,purchase,50,2023-01-02 10:00:00
# 5,view,0,2023-01-10 12:00:00
import pandas as pd
from io import StringIO
exposures_csv = StringIO("""user_id,variant,exposure_ts
1,control,2023-01-01 10:00:00
2,control,2023-01-01 10:05:00
3,variant,2023-01-01 11:00:00
4,variant,2023-01-02 09:00:00
5,control,2023-01-02 12:00:00
3,control,2023-01-03 12:00:00
""")
events_csv = StringIO("""user_id,event,amount,event_ts
1,view,0,2023-01-01 10:01:00
1,purchase,30,2023-01-02 08:00:00
2,view,0,2023-01-01 10:06:00
3,view,0,2023-01-01 11:05:00
3,purchase,20,2023-01-08 12:00:00
4,view,0,2023-01-02 09:10:00
4,purchase,50,2023-01-02 10:00:00
5,view,0,2023-01-10 12:00:00
""")
exp = pd.read_csv(exposures_csv, parse_dates=["exposure_ts"]).sort_values(["user_id","exposure_ts"])
exp_first = exp.drop_duplicates("user_id", keep="first")
evt = pd.read_csv(events_csv, parse_dates=["event_ts"])
# Join events to exposures within 7 days window
merged = evt.merge(exp_first, on="user_id", how="inner")
within_win = merged[(merged["event_ts"] >= merged["exposure_ts"]) &
(merged["event_ts"] <= merged["exposure_ts"] + pd.Timedelta(days=7))]
# Per-user features
user_metrics = within_win.groupby(["user_id","variant"], as_index=False).agg(
revenue=("amount","sum"),
conversions=("event", lambda s: (s=="purchase").any().astype(int))
)
# Users with no events in window should still appear (0 outcomes)
all_users = exp_first[["user_id","variant"]].merge(user_metrics, on=["user_id","variant"], how="left")
all_users[["revenue"]] = all_users[["revenue"]].fillna(0)
all_users[["conversions"]] = all_users[["conversions"]].fillna(0).astype(int)
summary = all_users.groupby("variant").agg(
users=("user_id","nunique"),
conversions=("conversions","sum"),
cr=("conversions", lambda x: x.mean()),
rpu=("revenue","mean")
).reset_index()
print(summary)
Expected conversion rate control ≈ 0.3333 (1/3), variant ≈ 0.5 (1/2); revenue per user control ≈ 10.0, variant ≈ 25.0.
Example 2: Bootstrap a 95% CI for RPU lift
Show code
import numpy as np
np.random.seed(7)
# Reuse all_users from Example 1
ctrl = all_users[all_users.variant=="control"]["revenue"].to_numpy()
varn = all_users[all_users.variant=="variant"]["revenue"].to_numpy()
B = 5000
diffs = []
for _ in range(B):
diff = np.mean(np.random.choice(varn, size=varn.size, replace=True)) - \
np.mean(np.random.choice(ctrl, size=ctrl.size, replace=True))
diffs.append(diff)
ci = (np.percentile(diffs, 2.5), np.percentile(diffs, 97.5))
print("Diff in RPU (variant - control) 95% CI:", ci)
On the toy data, you will see a CI around a positive difference (exact values vary due to resampling).
Example 3: SRM and guardrails
Show code
from scipy.stats import chisquare
counts = summary.set_index("variant")["users"]
# Expected equal split between 2 arms
observed = counts.to_numpy()
expected = np.repeat(observed.sum()/len(observed), len(observed))
chi2, p = chisquare(observed, f_exp=expected)
print({"chi2": chi2, "p_value": p})
# Guardrail example: error rate per user
# Suppose we have per-user error flags in a column 'had_error' (0/1)
# error_summary = all_users.groupby('variant')['had_error'].mean()
# Compare variant vs control and ensure it stays within agreed threshold.
A very small p-value (e.g., < 0.01) suggests SRM — re-check randomization, data collection, or filters.
Exercises you will complete here
These mirror the tasks in the Exercises section below. Do them in order.
- Exercise ex1: Build a reproducible A/B pipeline in pandas with exposure→outcome join, metrics, SRM, and bootstrap CI.
- Exercise ex2: Add CUPED variance reduction and a guardrail check to the pipeline.
- [ ] I used the first exposure per user and enforced a fixed window.
- [ ] My per-user dataset has one row per user with metric columns.
- [ ] I computed variant-level aggregates and validated SRM.
- [ ] I produced at least one CI for a key metric.
- [ ] I documented assumptions and any exclusions.
Common mistakes and self-checks
- Mistake: Counting events before exposure. Fix: Filter events where event_ts ≥ exposure_ts.
- Mistake: Multiple assignments per user. Fix: Keep first valid exposure per user.
- Mistake: Using event-level averages when metric is user-level. Fix: Aggregate per user first.
- Mistake: Ignoring SRM. Fix: Always test assignment counts with a chi-square.
- Mistake: Window drift. Fix: Use a consistent, explicit window (e.g., 7 days) for all users.
- Mistake: Outlier domination. Fix: Winsorize or report median/signed-rank sensitivity in addition to mean.
Practical projects
- Reusable Experiment Analyzer: a single notebook that loads CSVs, runs the pipeline, and outputs a clean report.
- Metric Registry: a small Python module with functions to compute standard metrics safely (conversion, ARPU, guardrails).
- SRM Monitor: a short script that prints SRM diagnostics for any experiment date range.
Mini challenge: Debug an analysis
You see a 20% lift in ARPU, but the SRM p-value is 0.0005. What should you do first?
Suggested approach
- Pause decision-making. Investigate assignment logs and filters.
- Check if one arm has more blocked/test traffic or a date gap.
- Re-run with corrected filters; only interpret metrics once SRM looks healthy.
Learning path
- Master pandas joins, groupby, and time windows.
- Learn bootstrap CIs and ratio metrics best practices.
- Add SRM checks and guardrails to every analysis.
- Introduce CUPED for high-variance revenue metrics.
- Create a one-click template for new experiments.
Who this is for
- Product Analysts and Data Analysts running A/B tests.
- PMs learning to interpret experiment results.
- Engineers validating experiment quality.
Prerequisites
- Comfort with Python and pandas (DataFrame operations, groupby, merge).
- Basic statistics (mean, proportion, confidence intervals).
- Understanding of A/B testing concepts (control vs variant, exposure, outcome).
Next steps
- Extend the pipeline to multi-metric reporting with clear guardrails.
- Add a variance-reduction switch (CUPED on/off) with a single parameter.
- Create a short decision summary template: effect, CI, risk, recommendation.
Take the quick test
The quick test below is available to everyone. If you’re logged in, your progress will be saved.