luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Experiment Analysis Pipelines

Learn Experiment Analysis Pipelines for free with explanations, exercises, and a quick test (for Product Analyst).

Published: December 22, 2025 | Updated: December 22, 2025

Why this matters

As a Product Analyst, you will repeatedly evaluate A/B and multivariate experiments. A solid pandas-based pipeline lets you go from raw logs to trustworthy decisions quickly and consistently.

  • Answer product questions: Did the new onboarding increase activation?
  • Guardrail checks: Did error rates or latency regress?
  • Consistent decisions: Reuse a vetted pipeline instead of one-off notebooks.

Concept explained simply

An experiment analysis pipeline is a repeatable set of steps that transforms raw exposure and event data into clean, comparable metrics with statistical uncertainty. Think: same recipe, different ingredients.

Mental model

Picture a conveyor belt:

  1. Specify: hypothesis, unit of randomization, success metrics, guardrails.
  2. Ingest: exposures (who saw what, when) + events (what they did, when).
  3. Clean: de-duplicate users, enforce exposure windows, filter bots/tests.
  4. Join: align events to each user’s first exposure.
  5. Aggregate: compute per-user metrics, then group-level summaries.
  6. Validate: SRM and sanity checks.
  7. Infer: effect sizes, confidence intervals, variance reduction if needed.
  8. Report: readable tables and a one-line recommendation.

The standard pipeline (step-by-step)

  1. Define the unit and exposure event (e.g., user-level; exposure when assignment cookie set).
  2. Load data into pandas: exposures and behavioral events.
  3. Keep the first valid exposure per user; exclude test traffic and bots if flagged.
  4. Join outcomes within a fixed window (e.g., 7 days after exposure_ts).
  5. Create per-user features: converted flag, revenue, sessions, errors.
  6. Aggregate by variant: sample size, conversion rate, revenue per user, guardrails.
  7. Run validity checks: SRM (sample imbalance), assignment leakage, date overlaps.
  8. Estimate uncertainty: bootstrap or analytical CIs; optionally apply CUPED.
  9. Summarize: lift vs control, CI, practical interpretation.

Worked examples

Example 1: Build an exposure→outcome dataset

Show data and code
# Sample exposures
# user_id,variant,exposure_ts
# 1,control,2023-01-01 10:00:00
# 2,control,2023-01-01 10:05:00
# 3,variant,2023-01-01 11:00:00
# 4,variant,2023-01-02 09:00:00
# 5,control,2023-01-02 12:00:00
# 3,control,2023-01-03 12:00:00  # duplicate user, keep first exposure only

# Sample events
# user_id,event,amount,event_ts
# 1,view,0,2023-01-01 10:01:00
# 1,purchase,30,2023-01-02 08:00:00
# 2,view,0,2023-01-01 10:06:00
# 3,view,0,2023-01-01 11:05:00
# 3,purchase,20,2023-01-08 12:00:00  # outside 7-day window by ~1 hour
# 4,view,0,2023-01-02 09:10:00
# 4,purchase,50,2023-01-02 10:00:00
# 5,view,0,2023-01-10 12:00:00

import pandas as pd
from io import StringIO

exposures_csv = StringIO("""user_id,variant,exposure_ts
1,control,2023-01-01 10:00:00
2,control,2023-01-01 10:05:00
3,variant,2023-01-01 11:00:00
4,variant,2023-01-02 09:00:00
5,control,2023-01-02 12:00:00
3,control,2023-01-03 12:00:00
""")

events_csv = StringIO("""user_id,event,amount,event_ts
1,view,0,2023-01-01 10:01:00
1,purchase,30,2023-01-02 08:00:00
2,view,0,2023-01-01 10:06:00
3,view,0,2023-01-01 11:05:00
3,purchase,20,2023-01-08 12:00:00
4,view,0,2023-01-02 09:10:00
4,purchase,50,2023-01-02 10:00:00
5,view,0,2023-01-10 12:00:00
""")

exp = pd.read_csv(exposures_csv, parse_dates=["exposure_ts"]).sort_values(["user_id","exposure_ts"])
exp_first = exp.drop_duplicates("user_id", keep="first")

evt = pd.read_csv(events_csv, parse_dates=["event_ts"]) 

# Join events to exposures within 7 days window
merged = evt.merge(exp_first, on="user_id", how="inner")
within_win = merged[(merged["event_ts"] >= merged["exposure_ts"]) &
                    (merged["event_ts"] <= merged["exposure_ts"] + pd.Timedelta(days=7))]

# Per-user features
user_metrics = within_win.groupby(["user_id","variant"], as_index=False).agg(
    revenue=("amount","sum"),
    conversions=("event", lambda s: (s=="purchase").any().astype(int))
)

# Users with no events in window should still appear (0 outcomes)
all_users = exp_first[["user_id","variant"]].merge(user_metrics, on=["user_id","variant"], how="left")
all_users[["revenue"]] = all_users[["revenue"]].fillna(0)
all_users[["conversions"]] = all_users[["conversions"]].fillna(0).astype(int)

summary = all_users.groupby("variant").agg(
    users=("user_id","nunique"),
    conversions=("conversions","sum"),
    cr=("conversions", lambda x: x.mean()),
    rpu=("revenue","mean")
).reset_index()

print(summary)

Expected conversion rate control ≈ 0.3333 (1/3), variant ≈ 0.5 (1/2); revenue per user control ≈ 10.0, variant ≈ 25.0.

Example 2: Bootstrap a 95% CI for RPU lift

Show code
import numpy as np
np.random.seed(7)

# Reuse all_users from Example 1
ctrl = all_users[all_users.variant=="control"]["revenue"].to_numpy()
varn = all_users[all_users.variant=="variant"]["revenue"].to_numpy()

B = 5000
diffs = []
for _ in range(B):
    diff = np.mean(np.random.choice(varn, size=varn.size, replace=True)) - \
           np.mean(np.random.choice(ctrl, size=ctrl.size, replace=True))
    diffs.append(diff)
ci = (np.percentile(diffs, 2.5), np.percentile(diffs, 97.5))
print("Diff in RPU (variant - control) 95% CI:", ci)

On the toy data, you will see a CI around a positive difference (exact values vary due to resampling).

Example 3: SRM and guardrails

Show code
from scipy.stats import chisquare

counts = summary.set_index("variant")["users"]
# Expected equal split between 2 arms
observed = counts.to_numpy()
expected = np.repeat(observed.sum()/len(observed), len(observed))
chi2, p = chisquare(observed, f_exp=expected)
print({"chi2": chi2, "p_value": p})

# Guardrail example: error rate per user
# Suppose we have per-user error flags in a column 'had_error' (0/1)
# error_summary = all_users.groupby('variant')['had_error'].mean()
# Compare variant vs control and ensure it stays within agreed threshold.

A very small p-value (e.g., < 0.01) suggests SRM — re-check randomization, data collection, or filters.

Exercises you will complete here

These mirror the tasks in the Exercises section below. Do them in order.

  1. Exercise ex1: Build a reproducible A/B pipeline in pandas with exposure→outcome join, metrics, SRM, and bootstrap CI.
  2. Exercise ex2: Add CUPED variance reduction and a guardrail check to the pipeline.
  • [ ] I used the first exposure per user and enforced a fixed window.
  • [ ] My per-user dataset has one row per user with metric columns.
  • [ ] I computed variant-level aggregates and validated SRM.
  • [ ] I produced at least one CI for a key metric.
  • [ ] I documented assumptions and any exclusions.

Common mistakes and self-checks

  • Mistake: Counting events before exposure. Fix: Filter events where event_ts ≥ exposure_ts.
  • Mistake: Multiple assignments per user. Fix: Keep first valid exposure per user.
  • Mistake: Using event-level averages when metric is user-level. Fix: Aggregate per user first.
  • Mistake: Ignoring SRM. Fix: Always test assignment counts with a chi-square.
  • Mistake: Window drift. Fix: Use a consistent, explicit window (e.g., 7 days) for all users.
  • Mistake: Outlier domination. Fix: Winsorize or report median/signed-rank sensitivity in addition to mean.

Practical projects

  • Reusable Experiment Analyzer: a single notebook that loads CSVs, runs the pipeline, and outputs a clean report.
  • Metric Registry: a small Python module with functions to compute standard metrics safely (conversion, ARPU, guardrails).
  • SRM Monitor: a short script that prints SRM diagnostics for any experiment date range.

Mini challenge: Debug an analysis

You see a 20% lift in ARPU, but the SRM p-value is 0.0005. What should you do first?

Suggested approach
  • Pause decision-making. Investigate assignment logs and filters.
  • Check if one arm has more blocked/test traffic or a date gap.
  • Re-run with corrected filters; only interpret metrics once SRM looks healthy.

Learning path

  1. Master pandas joins, groupby, and time windows.
  2. Learn bootstrap CIs and ratio metrics best practices.
  3. Add SRM checks and guardrails to every analysis.
  4. Introduce CUPED for high-variance revenue metrics.
  5. Create a one-click template for new experiments.

Who this is for

  • Product Analysts and Data Analysts running A/B tests.
  • PMs learning to interpret experiment results.
  • Engineers validating experiment quality.

Prerequisites

  • Comfort with Python and pandas (DataFrame operations, groupby, merge).
  • Basic statistics (mean, proportion, confidence intervals).
  • Understanding of A/B testing concepts (control vs variant, exposure, outcome).

Next steps

  • Extend the pipeline to multi-metric reporting with clear guardrails.
  • Add a variance-reduction switch (CUPED on/off) with a single parameter.
  • Create a short decision summary template: effect, CI, risk, recommendation.

Take the quick test

The quick test below is available to everyone. If you’re logged in, your progress will be saved.

Practice Exercises

2 exercises to complete

Instructions

Using the sample CSVs below, build a pipeline that:

  1. Loads exposures and events.
  2. Keeps the first exposure per user.
  3. Joins outcomes within a 7-day window after exposure.
  4. Creates per-user metrics: converted (0/1), revenue (sum of purchase amounts).
  5. Aggregates by variant: users, conversions, conversion rate, revenue per user.
  6. Runs an SRM chi-square test on sample sizes.
  7. Bootstraps a 95% CI for the difference in revenue per user (variant - control).
Exposures CSV
user_id,variant,exposure_ts
1,control,2023-01-01 10:00:00
2,control,2023-01-01 10:05:00
3,variant,2023-01-01 11:00:00
4,variant,2023-01-02 09:00:00
5,control,2023-01-02 12:00:00
3,control,2023-01-03 12:00:00
Events CSV
user_id,event,amount,event_ts
1,view,0,2023-01-01 10:01:00
1,purchase,30,2023-01-02 08:00:00
2,view,0,2023-01-01 10:06:00
3,view,0,2023-01-01 11:05:00
3,purchase,20,2023-01-08 12:00:00
4,view,0,2023-01-02 09:10:00
4,purchase,50,2023-01-02 10:00:00
5,view,0,2023-01-10 12:00:00
Expected Output
Variant-level summary similar to: control users=3, conversions=1, cr≈0.3333, rpu≈10.0; variant users=2, conversions=1, cr=0.5, rpu=25.0. SRM p-value should not flag an issue on this toy data. A 95% bootstrap CI for RPU difference around positive values (exact numbers vary).

Experiment Analysis Pipelines — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Experiment Analysis Pipelines?

AI Assistant

Ask questions about this tool