Who this is for
Product Analysts and data-curious PMs who run experiments and need to understand how different user groups respond to changes. If you already know A/B test basics, this lesson helps you find where the impact truly lives.
Prerequisites
- Comfort with conversion rate, lift, confidence intervals, and p-values
- Basic understanding of experiment design (control vs. variant, randomization)
- Ability to compute metrics per group (e.g., SQL or spreadsheet skills)
Why this matters
In real products, user behavior varies by device, country, lifecycle stage, acquisition channel, and more. An experiment may be neutral overall but highly positive for a key segment—or vice versa. Segment-level analysis helps you:
- Find where the lift is coming from (heterogeneous treatment effects)
- Detect adverse effects hidden in averages (avoid shipping regressions)
- Decide targeted rollouts (e.g., ship to iOS first)
- Diagnose data issues like SRM (Sample Ratio Mismatch) within segments
Note: The quick test at the end is available to everyone; only logged-in users get saved progress.
Concept explained simply
Segment-level analysis means splitting users into meaningful groups and measuring the experiment effect within each group. You still follow the same steps as overall analysis—compute the metric for control and variant, estimate lift and uncertainty—but you do it repeatedly per segment.
Mental model
Think of your overall result as a weighted average of segment results. Large segments with small lift can outweigh small segments with big lift. If segment behavior differs a lot, the average can mislead you (Simpson’s paradox).
Good segments to consider
- Platform: iOS, Android, Web
- Lifecycle: New vs. Returning users, Tenure buckets
- Acquisition: Paid vs. Organic, Channel
- Geography: Country, Region, Language
- User attributes: Price tier, Device class, App version
- Context: Time-of-day, Day-of-week, Seasonality
When to segment (and how much)
- Pre-specify a short list of primary segments tied to your hypothesis (e.g., “This navigation change likely helps mobile more”).
- Limit to a manageable number to avoid power loss and false positives.
- Use segments that existed before the experiment (no post-hoc cherry-picking).
- If exploring many segments, treat findings as directional and confirm with follow-up tests.
Workflow: step-by-step
- Define segments and metrics upfront (primary vs. exploratory).
- Check data health per segment: sample balance and SRM.
- Compute per-segment metrics for control and variant.
- Estimate lift and uncertainty per segment (CI or p-value).
- Look for consistent patterns (e.g., mobile segments all positive).
- Make a decision: ship globally, ship to a segment, iterate, or stop.
Tip: Guardrails per segment
Track critical guardrails (e.g., crash rate, latency) by segment to catch regressions even if primary metrics look fine.
Worked examples
Example 1: Conversion by device
Metric: Signup conversion. Hypothesis: New layout helps small screens more.
- iOS: Control 10,000 users, 1,000 signups (10.0%); Variant 10,500 users, 1,190 signups (11.33%) → Lift ≈ +13.3%
- Android: Control 12,000 users, 1,560 signups (13.0%); Variant 11,800 users, 1,416 signups (12.0%) → Lift ≈ −7.7%
- Overall: Control 22,000/2,560 (11.64%); Variant 22,300/2,606 (11.69%) → Overall lift ≈ +0.4%
Insight: Overall looks flat, but iOS is clearly up and Android down. Consider shipping to iOS only while iterating on Android UX.
Example 2: Revenue per user by acquisition channel
Metric: Revenue/user (RPU). Hypothesis: Paywall tweak helps Organic more than Paid.
- Organic: Control $1.80; Variant $2.00 → +11.1%
- Paid: Control $2.40; Variant $2.36 → −1.7%
- Overall depends on mix: if Organic is 70% of traffic, overall may still be positive; if Paid dominates, effect can vanish.
Decision: If Paid users are sensitive to pricing, roll out to Organic first; run a tailored variant for Paid.
Example 3: Retention by lifecycle stage
Metric: D7 retention. Hypothesis: New onboarding helps new users more.
- New users: Control 24%; Variant 27% → +3pp
- Returning users: Control 44%; Variant 43% → −1pp
Weighted overall effect will depend on the share of new vs. returning users. If your growth strategy focuses on acquisition, the +3pp for new users may justify shipping with a guardrail for returning users.
Quality checks before trusting segment results
- SRM by segment: Are control/variant allocations close to expected (e.g., 50/50)? Large deviations can indicate tracking or assignment issues.
- Power: Tiny segments may show noisy swings. Prefer confidence intervals and directionality over just p-values in small segments.
- Multiple looks: If you peek at many segments, treat results as exploratory and confirm.
Quick SRM self-check
Compare the share of each segment in control vs. variant. If the difference is >2 percentage points for a big segment, investigate randomization and event logging.
Common mistakes and how to self-check
- Over-segmentation: Splitting too thinly reduces power. Self-check: Do your segments each have enough users to stabilize metrics?
- Post-hoc storytelling: Only highlighting favorable segments. Self-check: Which segments were pre-registered vs. exploratory?
- Ignoring guardrails: Shipping wins that hurt stability. Self-check: Review error/crash/latency metrics per segment.
- Simpson’s paradox: Trusting the overall average. Self-check: Compare segment effects to overall; if mixed signs, never ship blindly.
- Mix shift: Comparing segments across days with traffic mix changes. Self-check: Use the same test window and consider weighting.
Exercises
Try the dataset below. Compute per-segment conversion, lift, SRM check, and a rollout recommendation.
- Segments: iOS_New, iOS_Returning, Android_New, Android_Returning
Show data
- iOS_New: Control 4,000 users, 360 signups; Variant 4,200 users, 504 signups
- iOS_Returning: Control 3,000 users, 420 signups; Variant 3,100 users, 403 signups
- Android_New: Control 3,500 users, 350 signups; Variant 3,100 users, 279 signups
- Android_Returning: Control 2,500 users, 325 signups; Variant 2,800 users, 336 signups
- Compute conversion rates and lifts per segment.
- Compute overall conversion and lift.
- SRM check: Compare each segment’s share of users in control vs. variant. Flag any segment with >2pp difference.
- Recommend: Global ship, segment-only ship, or iterate. Justify with numbers.
- Checklist:
- Per-segment conversion and lift calculated
- Overall numbers calculated
- SRM check done per segment
- Decision noted with reasoning
Compare your work with the solution in the exercise section below.
Practical projects
- Build a reusable notebook or spreadsheet that takes segment-level inputs and outputs lift, CIs, and SRM checks per segment.
- Create a segment-aware dashboard for your growth or monetization experiments, including guardrails.
- Run a follow-up targeted experiment on the best-performing segment to validate and size the opportunity.
Mini challenge
Your test shows +0.5% overall revenue/user, but Paid traffic is −3% and Organic is +2%. Paid makes up 30% of traffic and has higher ARPU customers. What do you do?
View a good answer
Ship to Organic only; run a tailored variant for Paid. Monitor guardrails for Paid closely. Consider pricing sensitivity or creative changes for Paid users, then re-test before global rollout.
Learning path
- Before this: A/B test fundamentals (metrics, power, bias)
- This lesson: Finding heterogeneous effects across segments
- Next: Multiple testing control and confirmatory follow-ups
Next steps
- Apply this workflow to your last experiment’s raw data.
- Pre-register 3–5 meaningful segments for your next test.
- Take the quick test below to cement your understanding.