How to learn Cohort Tables In Pandas for Python pandas in Product Analyst for free

Why this matters

Cohort tables show how groups of users (acquired in the same period) behave over time. As a Product Analyst, you use cohort analysis to answer questions like:

How well do users we acquired in January retain in the next 3 months?
Which campaign delivers higher 2nd-month repeat purchase rate?
What is the cumulative revenue per cohort after 90 days?

These insights inform product changes, lifecycle messaging, and budget allocation.

Who this is for

Product Analysts and Data Analysts who report on retention, revenue, and repeat behavior.
PMs who need quick, reliable retention tables without BI tool overhead.

Prerequisites

Comfort with pandas DataFrames: grouping, merging, pivot_table
Basic date operations in pandas (dt.to_period, dt.to_timestamp)
Know how to interpret retention and ARPU

Concept explained simply

A cohort is a group of users who share a starting moment (e.g., first order month). A cohort table shows how that group behaves in subsequent periods (month 1, 2, 3...).

Mental model

Think of each cohort as a classroom that starts in Month 0. "Period 1" is their first month after starting, "Period 2" is their second, and so on. Each period answers: how many classmates are still active (retention), how much they spent (revenue), or how many repeated (repeat rate).

Data you need

user_id
order_date (or event_date for activity)
amount (optional, for revenue cohorts)

We derive per-user cohort start (first order month) and then compare each order's month to that start to compute a cohort_index (period number).

Step-by-step build

Normalize dates to month buckets.
Assign a cohort month per user (their first order month).
Compute cohort_index = months since cohort month + 1.
Aggregate with groupby to counts (users) or sums (revenue).
pivot_table to wide format and format as rates or currency.

Helper code you will reuse

import pandas as pd

def month_floor(s):
    return s.dt.to_period('M').dt.to_timestamp()

def add_cohort_columns(df, user_col='user_id', date_col='order_date'):
    df = df.copy()
    df['OrderMonth'] = month_floor(df[date_col])
    first_order = (
        df.groupby(user_col)[date_col]
          .min()
          .rename('CohortMonth')
          .to_frame()
    )
    first_order['CohortMonth'] = month_floor(first_order['CohortMonth'])
    df = df.merge(first_order, on=user_col, how='left')
    # cohort index is 1-based number of months since cohort start
    df['cohort_index'] = (
        (df['OrderMonth'].dt.year - df['CohortMonth'].dt.year) * 12
        + (df['OrderMonth'].dt.month - df['CohortMonth'].dt.month)
        + 1
    )
    return df

Worked examples

Example 1 — Acquisition and retention counts

Data:

import pandas as pd
from io import StringIO
csv = StringIO('''user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50
''')
df = pd.read_csv(csv, parse_dates=['signup_date','order_date'])
df = add_cohort_columns(df, 'user_id', 'order_date')

# cohort sizes (unique users with period 1 activity)
cohort_counts = (df.groupby(['CohortMonth','cohort_index'])['user_id']
                   .nunique()
                   .reset_index())
cohort_sizes = (cohort_counts[cohort_counts['cohort_index']==1]
                .set_index('CohortMonth')['user_id'])

retention = (cohort_counts
    .pivot(index='CohortMonth', columns='cohort_index', values='user_id')
    .divide(cohort_sizes, axis=0)
    .round(2)
    .fillna(0.0))
print(retention)

Expected retention (rate):

cohort_index   1     2     3
CohortMonth                  
2023-01-01   1.00  1.00  0.50
2023-02-01   1.00  0.50  0.00
2023-03-01   1.00  0.00  0.00

Example 2 — Revenue by cohort and period (ARPU)

revenue = (df.groupby(['CohortMonth','cohort_index'])['amount']
             .sum()
             .reset_index())
rev_table = revenue.pivot(index='CohortMonth', columns='cohort_index', values='amount').fillna(0)
cohort_sizes = (df.groupby(['CohortMonth'])['user_id'].nunique())
arpu = (rev_table.divide(cohort_sizes, axis=0).round(2))
print(rev_table)
print(arpu)

Expected totals and ARPU:

# revenue totals
cohort_index     1     2     3
CohortMonth                    
2023-01-01     50    25    10
2023-02-01     65    25     0
2023-03-01     50     0     0

# ARPU per period
cohort_index      1      2     3
CohortMonth                       
2023-01-01    25.00  12.50  5.00
2023-02-01    32.50  12.50  0.00
2023-03-01    50.00   0.00  0.00

Example 3 — Percentage retention with formatting

retention_pct = (retention * 100).round(0).astype(int).astype(str) + '%'
print(retention_pct)
# Optionally style for Jupyter
# retention.style.format('{:.0%}').background_gradient(cmap='Blues')

Expected:

cohort_index   1    2    3
CohortMonth                
2023-01-01   100  100   50
2023-02-01   100   50    0
2023-03-01   100    0    0

How to read a cohort table

Row = a cohort (users who started in that month).
Column = months since start (period 1, 2, 3...).
Retention rate = period active users / cohort size.
Revenue cohort = how much that cohort spent in each period; ARPU = revenue / cohort size.

Practice & exercises

Everyone can do the exercises and quick test. Note: only logged-in users have progress auto-saved.

Do Exercise 1 (retention table)
Do Exercise 2 (revenue cohort & ARPU)
Use the checklist below to self-verify

Exercise 1 — Build a monthly retention table

Goal: Create a retention table with cohort_index 1–3 using the dataset below. Show rates (0–1) with 2 decimals.

Dataset (CSV):

user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50

Expected output (retention rates):

cohort_index   1     2     3
CohortMonth                  
2023-01-01   1.00  1.00  0.50
2023-02-01   1.00  0.50  0.00
2023-03-01   1.00  0.00  0.00

Use add_cohort_columns helper.
Compute cohort_sizes from users active in period 1.
Divide pivoted counts by cohort_sizes row-wise.

Hints

Normalize to month with dt.to_period('M').dt.to_timestamp().
cohort_index must be 1-based.
Use nunique on user_id to avoid double-counting users within a period.

Show solution

import pandas as pd
from io import StringIO

csv = StringIO('''user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50
''')

def month_floor(s):
    return s.dt.to_period('M').dt.to_timestamp()

def add_cohort_columns(df, user_col='user_id', date_col='order_date'):
    df = df.copy()
    df['OrderMonth'] = month_floor(df[date_col])
    first = (df.groupby(user_col)[date_col].min().rename('CohortMonth'))
    first = month_floor(first.to_frame()['CohortMonth']).to_frame()
    df = df.merge(first, left_on=user_col, right_index=True, how='left')
    df['cohort_index'] = (
        (df['OrderMonth'].dt.year - df['CohortMonth'].dt.year) * 12
        + (df['OrderMonth'].dt.month - df['CohortMonth'].dt.month)
        + 1
    )
    return df

df = pd.read_csv(csv, parse_dates=['signup_date','order_date'])
df = add_cohort_columns(df, 'user_id', 'order_date')
cohort_counts = (df.groupby(['CohortMonth','cohort_index'])['user_id']
                   .nunique()
                   .reset_index())
cohort_sizes = (cohort_counts[cohort_counts['cohort_index']==1]
                .set_index('CohortMonth')['user_id'])
retention = (cohort_counts
    .pivot(index='CohortMonth', columns='cohort_index', values='user_id')
    .divide(cohort_sizes, axis=0)
    .round(2)
    .fillna(0.0))
print(retention)

Exercise 2 — Revenue cohorts and ARPU

Goal: Using the same dataset, compute revenue per cohort and period, then ARPU per period.

Expected revenue totals:

cohort_index     1     2     3
CohortMonth                    
2023-01-01     50    25    10
2023-02-01     65    25     0
2023-03-01     50     0     0

Expected ARPU per period:

cohort_index      1      2     3
CohortMonth                       
2023-01-01    25.00  12.50  5.00
2023-02-01    32.50  12.50  0.00
2023-03-01    50.00   0.00  0.00

Hints

Group by ['CohortMonth','cohort_index'] and sum amount.
Divide revenue pivot by cohort_sizes to get ARPU.
Keep two decimals.

Show solution

# Reuse df from Exercise 1
a = (df.groupby(['CohortMonth','cohort_index'])['amount']
       .sum()
       .reset_index())
rev_table = a.pivot(index='CohortMonth', columns='cohort_index', values='amount').fillna(0)
cohort_sizes = df.groupby('CohortMonth')['user_id'].nunique()
arpu = rev_table.divide(cohort_sizes, axis=0).round(2)
print(rev_table)
print(arpu)

Exercise checklist

Dates are month-normalized before grouping
CohortMonth is per-user first order month
cohort_index is 1-based
Counts use nunique(user_id)
Retention table divides by cohort size row-wise

Common mistakes

Using count instead of nunique: double-counts repeat orders within a period.
0-based cohort index: shifts all periods; always start at 1.
Mismatched month granularity: grouping by full date instead of month yields sparse, misleading tables.
Dividing by total users overall instead of cohort row size: makes rates incomparable across cohorts.
Mixing acquisition and sign-up definitions: be explicit whether CohortMonth is first sign-up or first order.

How to self-check

Sum of period-1 retention should be exactly 1.00 (100%) for each row.
Period N retention cannot exceed period N-1 for the same row (monotonic non-increasing by activity definition).
ARPU period 1 × cohort size should equal period 1 revenue total.

Practical projects

Build a retention dashboard notebook: functions for cohort tables, retention %, ARPU, and cohort-size sparkline per row.
Campaign comparison: split users by acquisition source, then produce separate cohort tables to compare 2nd-month retention.
Revenue expansion: compute cumulative revenue per cohort and highlight when cohorts break even on CAC (use your own CAC inputs).

Learning path

Before: Data cleaning in pandas, date handling, groupby and pivot_table
Now: Cohort tables (retention and revenue)
Next: Rolling retention vs. classic period retention, survival curves, and funnel-to-retention links

Next steps

Parameterize your cohort code into reusable functions.
Add filters by product line or acquisition channel.
Format output for stakeholders: round rates, highlight key periods, add comments.

Mini challenge

Extend the retention code to compute repeat-purchase rate: number of users with at least 2 orders by period / cohort size. Show a table for periods 1–3.

Menu

Cohort Tables In Pandas

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Data you need

Step-by-step build

Worked examples

How to read a cohort table

Practice & exercises

Exercise checklist

Common mistakes

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Build a monthly retention table

Instructions

Expected Output

Revenue cohorts and ARPU

Have questions about Cohort Tables In Pandas?

AI Assistant