Why this matters
Cohort tables show how groups of users (acquired in the same period) behave over time. As a Product Analyst, you use cohort analysis to answer questions like:
- How well do users we acquired in January retain in the next 3 months?
- Which campaign delivers higher 2nd-month repeat purchase rate?
- What is the cumulative revenue per cohort after 90 days?
These insights inform product changes, lifecycle messaging, and budget allocation.
Who this is for
- Product Analysts and Data Analysts who report on retention, revenue, and repeat behavior.
- PMs who need quick, reliable retention tables without BI tool overhead.
Prerequisites
- Comfort with pandas DataFrames: grouping, merging, pivot_table
- Basic date operations in pandas (dt.to_period, dt.to_timestamp)
- Know how to interpret retention and ARPU
Concept explained simply
A cohort is a group of users who share a starting moment (e.g., first order month). A cohort table shows how that group behaves in subsequent periods (month 1, 2, 3...).
Mental model
Think of each cohort as a classroom that starts in Month 0. "Period 1" is their first month after starting, "Period 2" is their second, and so on. Each period answers: how many classmates are still active (retention), how much they spent (revenue), or how many repeated (repeat rate).
Data you need
- user_id
- order_date (or event_date for activity)
- amount (optional, for revenue cohorts)
We derive per-user cohort start (first order month) and then compare each order's month to that start to compute a cohort_index (period number).
Step-by-step build
- Normalize dates to month buckets.
- Assign a cohort month per user (their first order month).
- Compute cohort_index = months since cohort month + 1.
- Aggregate with groupby to counts (users) or sums (revenue).
- pivot_table to wide format and format as rates or currency.
Helper code you will reuse
import pandas as pd
def month_floor(s):
return s.dt.to_period('M').dt.to_timestamp()
def add_cohort_columns(df, user_col='user_id', date_col='order_date'):
df = df.copy()
df['OrderMonth'] = month_floor(df[date_col])
first_order = (
df.groupby(user_col)[date_col]
.min()
.rename('CohortMonth')
.to_frame()
)
first_order['CohortMonth'] = month_floor(first_order['CohortMonth'])
df = df.merge(first_order, on=user_col, how='left')
# cohort index is 1-based number of months since cohort start
df['cohort_index'] = (
(df['OrderMonth'].dt.year - df['CohortMonth'].dt.year) * 12
+ (df['OrderMonth'].dt.month - df['CohortMonth'].dt.month)
+ 1
)
return dfWorked examples
Example 1 — Acquisition and retention counts
Data:
import pandas as pd
from io import StringIO
csv = StringIO('''user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50
''')
df = pd.read_csv(csv, parse_dates=['signup_date','order_date'])
df = add_cohort_columns(df, 'user_id', 'order_date')
# cohort sizes (unique users with period 1 activity)
cohort_counts = (df.groupby(['CohortMonth','cohort_index'])['user_id']
.nunique()
.reset_index())
cohort_sizes = (cohort_counts[cohort_counts['cohort_index']==1]
.set_index('CohortMonth')['user_id'])
retention = (cohort_counts
.pivot(index='CohortMonth', columns='cohort_index', values='user_id')
.divide(cohort_sizes, axis=0)
.round(2)
.fillna(0.0))
print(retention)Expected retention (rate):
cohort_index 1 2 3
CohortMonth
2023-01-01 1.00 1.00 0.50
2023-02-01 1.00 0.50 0.00
2023-03-01 1.00 0.00 0.00Example 2 — Revenue by cohort and period (ARPU)
revenue = (df.groupby(['CohortMonth','cohort_index'])['amount']
.sum()
.reset_index())
rev_table = revenue.pivot(index='CohortMonth', columns='cohort_index', values='amount').fillna(0)
cohort_sizes = (df.groupby(['CohortMonth'])['user_id'].nunique())
arpu = (rev_table.divide(cohort_sizes, axis=0).round(2))
print(rev_table)
print(arpu)Expected totals and ARPU:
# revenue totals
cohort_index 1 2 3
CohortMonth
2023-01-01 50 25 10
2023-02-01 65 25 0
2023-03-01 50 0 0
# ARPU per period
cohort_index 1 2 3
CohortMonth
2023-01-01 25.00 12.50 5.00
2023-02-01 32.50 12.50 0.00
2023-03-01 50.00 0.00 0.00Example 3 — Percentage retention with formatting
retention_pct = (retention * 100).round(0).astype(int).astype(str) + '%'
print(retention_pct)
# Optionally style for Jupyter
# retention.style.format('{:.0%}').background_gradient(cmap='Blues')Expected:
cohort_index 1 2 3
CohortMonth
2023-01-01 100 100 50
2023-02-01 100 50 0
2023-03-01 100 0 0How to read a cohort table
- Row = a cohort (users who started in that month).
- Column = months since start (period 1, 2, 3...).
- Retention rate = period active users / cohort size.
- Revenue cohort = how much that cohort spent in each period; ARPU = revenue / cohort size.
Practice & exercises
Everyone can do the exercises and quick test. Note: only logged-in users have progress auto-saved.
- Do Exercise 1 (retention table)
- Do Exercise 2 (revenue cohort & ARPU)
- Use the checklist below to self-verify
Exercise 1 — Build a monthly retention table
Goal: Create a retention table with cohort_index 1–3 using the dataset below. Show rates (0–1) with 2 decimals.
Dataset (CSV):
user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50Expected output (retention rates):
cohort_index 1 2 3
CohortMonth
2023-01-01 1.00 1.00 0.50
2023-02-01 1.00 0.50 0.00
2023-03-01 1.00 0.00 0.00- Use add_cohort_columns helper.
- Compute cohort_sizes from users active in period 1.
- Divide pivoted counts by cohort_sizes row-wise.
Hints
- Normalize to month with dt.to_period('M').dt.to_timestamp().
- cohort_index must be 1-based.
- Use nunique on user_id to avoid double-counting users within a period.
Show solution
import pandas as pd
from io import StringIO
csv = StringIO('''user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50
''')
def month_floor(s):
return s.dt.to_period('M').dt.to_timestamp()
def add_cohort_columns(df, user_col='user_id', date_col='order_date'):
df = df.copy()
df['OrderMonth'] = month_floor(df[date_col])
first = (df.groupby(user_col)[date_col].min().rename('CohortMonth'))
first = month_floor(first.to_frame()['CohortMonth']).to_frame()
df = df.merge(first, left_on=user_col, right_index=True, how='left')
df['cohort_index'] = (
(df['OrderMonth'].dt.year - df['CohortMonth'].dt.year) * 12
+ (df['OrderMonth'].dt.month - df['CohortMonth'].dt.month)
+ 1
)
return df
df = pd.read_csv(csv, parse_dates=['signup_date','order_date'])
df = add_cohort_columns(df, 'user_id', 'order_date')
cohort_counts = (df.groupby(['CohortMonth','cohort_index'])['user_id']
.nunique()
.reset_index())
cohort_sizes = (cohort_counts[cohort_counts['cohort_index']==1]
.set_index('CohortMonth')['user_id'])
retention = (cohort_counts
.pivot(index='CohortMonth', columns='cohort_index', values='user_id')
.divide(cohort_sizes, axis=0)
.round(2)
.fillna(0.0))
print(retention)Exercise 2 — Revenue cohorts and ARPU
Goal: Using the same dataset, compute revenue per cohort and period, then ARPU per period.
Expected revenue totals:
cohort_index 1 2 3
CohortMonth
2023-01-01 50 25 10
2023-02-01 65 25 0
2023-03-01 50 0 0Expected ARPU per period:
cohort_index 1 2 3
CohortMonth
2023-01-01 25.00 12.50 5.00
2023-02-01 32.50 12.50 0.00
2023-03-01 50.00 0.00 0.00Hints
- Group by ['CohortMonth','cohort_index'] and sum amount.
- Divide revenue pivot by cohort_sizes to get ARPU.
- Keep two decimals.
Show solution
# Reuse df from Exercise 1
a = (df.groupby(['CohortMonth','cohort_index'])['amount']
.sum()
.reset_index())
rev_table = a.pivot(index='CohortMonth', columns='cohort_index', values='amount').fillna(0)
cohort_sizes = df.groupby('CohortMonth')['user_id'].nunique()
arpu = rev_table.divide(cohort_sizes, axis=0).round(2)
print(rev_table)
print(arpu)Exercise checklist
- Dates are month-normalized before grouping
- CohortMonth is per-user first order month
- cohort_index is 1-based
- Counts use nunique(user_id)
- Retention table divides by cohort size row-wise
Common mistakes
- Using count instead of nunique: double-counts repeat orders within a period.
- 0-based cohort index: shifts all periods; always start at 1.
- Mismatched month granularity: grouping by full date instead of month yields sparse, misleading tables.
- Dividing by total users overall instead of cohort row size: makes rates incomparable across cohorts.
- Mixing acquisition and sign-up definitions: be explicit whether CohortMonth is first sign-up or first order.
How to self-check
- Sum of period-1 retention should be exactly 1.00 (100%) for each row.
- Period N retention cannot exceed period N-1 for the same row (monotonic non-increasing by activity definition).
- ARPU period 1 Ă— cohort size should equal period 1 revenue total.
Practical projects
- Build a retention dashboard notebook: functions for cohort tables, retention %, ARPU, and cohort-size sparkline per row.
- Campaign comparison: split users by acquisition source, then produce separate cohort tables to compare 2nd-month retention.
- Revenue expansion: compute cumulative revenue per cohort and highlight when cohorts break even on CAC (use your own CAC inputs).
Learning path
- Before: Data cleaning in pandas, date handling, groupby and pivot_table
- Now: Cohort tables (retention and revenue)
- Next: Rolling retention vs. classic period retention, survival curves, and funnel-to-retention links
Next steps
- Parameterize your cohort code into reusable functions.
- Add filters by product line or acquisition channel.
- Format output for stakeholders: round rates, highlight key periods, add comments.
Mini challenge
Extend the retention code to compute repeat-purchase rate: number of users with at least 2 orders by period / cohort size. Show a table for periods 1–3.