luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Cohort Tables In Pandas

Learn Cohort Tables In Pandas for free with explanations, exercises, and a quick test (for Product Analyst).

Published: December 22, 2025 | Updated: December 22, 2025

Why this matters

Cohort tables show how groups of users (acquired in the same period) behave over time. As a Product Analyst, you use cohort analysis to answer questions like:

  • How well do users we acquired in January retain in the next 3 months?
  • Which campaign delivers higher 2nd-month repeat purchase rate?
  • What is the cumulative revenue per cohort after 90 days?

These insights inform product changes, lifecycle messaging, and budget allocation.

Who this is for

  • Product Analysts and Data Analysts who report on retention, revenue, and repeat behavior.
  • PMs who need quick, reliable retention tables without BI tool overhead.

Prerequisites

  • Comfort with pandas DataFrames: grouping, merging, pivot_table
  • Basic date operations in pandas (dt.to_period, dt.to_timestamp)
  • Know how to interpret retention and ARPU

Concept explained simply

A cohort is a group of users who share a starting moment (e.g., first order month). A cohort table shows how that group behaves in subsequent periods (month 1, 2, 3...).

Mental model

Think of each cohort as a classroom that starts in Month 0. "Period 1" is their first month after starting, "Period 2" is their second, and so on. Each period answers: how many classmates are still active (retention), how much they spent (revenue), or how many repeated (repeat rate).

Data you need

  • user_id
  • order_date (or event_date for activity)
  • amount (optional, for revenue cohorts)

We derive per-user cohort start (first order month) and then compare each order's month to that start to compute a cohort_index (period number).

Step-by-step build

  1. Normalize dates to month buckets.
  2. Assign a cohort month per user (their first order month).
  3. Compute cohort_index = months since cohort month + 1.
  4. Aggregate with groupby to counts (users) or sums (revenue).
  5. pivot_table to wide format and format as rates or currency.
Helper code you will reuse
import pandas as pd

def month_floor(s):
    return s.dt.to_period('M').dt.to_timestamp()

def add_cohort_columns(df, user_col='user_id', date_col='order_date'):
    df = df.copy()
    df['OrderMonth'] = month_floor(df[date_col])
    first_order = (
        df.groupby(user_col)[date_col]
          .min()
          .rename('CohortMonth')
          .to_frame()
    )
    first_order['CohortMonth'] = month_floor(first_order['CohortMonth'])
    df = df.merge(first_order, on=user_col, how='left')
    # cohort index is 1-based number of months since cohort start
    df['cohort_index'] = (
        (df['OrderMonth'].dt.year - df['CohortMonth'].dt.year) * 12
        + (df['OrderMonth'].dt.month - df['CohortMonth'].dt.month)
        + 1
    )
    return df

Worked examples

Example 1 — Acquisition and retention counts

Data:

import pandas as pd
from io import StringIO
csv = StringIO('''user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50
''')
df = pd.read_csv(csv, parse_dates=['signup_date','order_date'])
df = add_cohort_columns(df, 'user_id', 'order_date')

# cohort sizes (unique users with period 1 activity)
cohort_counts = (df.groupby(['CohortMonth','cohort_index'])['user_id']
                   .nunique()
                   .reset_index())
cohort_sizes = (cohort_counts[cohort_counts['cohort_index']==1]
                .set_index('CohortMonth')['user_id'])

retention = (cohort_counts
    .pivot(index='CohortMonth', columns='cohort_index', values='user_id')
    .divide(cohort_sizes, axis=0)
    .round(2)
    .fillna(0.0))
print(retention)

Expected retention (rate):

cohort_index   1     2     3
CohortMonth                  
2023-01-01   1.00  1.00  0.50
2023-02-01   1.00  0.50  0.00
2023-03-01   1.00  0.00  0.00
Example 2 — Revenue by cohort and period (ARPU)
revenue = (df.groupby(['CohortMonth','cohort_index'])['amount']
             .sum()
             .reset_index())
rev_table = revenue.pivot(index='CohortMonth', columns='cohort_index', values='amount').fillna(0)
cohort_sizes = (df.groupby(['CohortMonth'])['user_id'].nunique())
arpu = (rev_table.divide(cohort_sizes, axis=0).round(2))
print(rev_table)
print(arpu)

Expected totals and ARPU:

# revenue totals
cohort_index     1     2     3
CohortMonth                    
2023-01-01     50    25    10
2023-02-01     65    25     0
2023-03-01     50     0     0

# ARPU per period
cohort_index      1      2     3
CohortMonth                       
2023-01-01    25.00  12.50  5.00
2023-02-01    32.50  12.50  0.00
2023-03-01    50.00   0.00  0.00
Example 3 — Percentage retention with formatting
retention_pct = (retention * 100).round(0).astype(int).astype(str) + '%'
print(retention_pct)
# Optionally style for Jupyter
# retention.style.format('{:.0%}').background_gradient(cmap='Blues')

Expected:

cohort_index   1    2    3
CohortMonth                
2023-01-01   100  100   50
2023-02-01   100   50    0
2023-03-01   100    0    0

How to read a cohort table

  • Row = a cohort (users who started in that month).
  • Column = months since start (period 1, 2, 3...).
  • Retention rate = period active users / cohort size.
  • Revenue cohort = how much that cohort spent in each period; ARPU = revenue / cohort size.

Practice & exercises

Everyone can do the exercises and quick test. Note: only logged-in users have progress auto-saved.

  • Do Exercise 1 (retention table)
  • Do Exercise 2 (revenue cohort & ARPU)
  • Use the checklist below to self-verify
Exercise 1 — Build a monthly retention table

Goal: Create a retention table with cohort_index 1–3 using the dataset below. Show rates (0–1) with 2 decimals.

Dataset (CSV):

user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50

Expected output (retention rates):

cohort_index   1     2     3
CohortMonth                  
2023-01-01   1.00  1.00  0.50
2023-02-01   1.00  0.50  0.00
2023-03-01   1.00  0.00  0.00
  • Use add_cohort_columns helper.
  • Compute cohort_sizes from users active in period 1.
  • Divide pivoted counts by cohort_sizes row-wise.
Hints
  • Normalize to month with dt.to_period('M').dt.to_timestamp().
  • cohort_index must be 1-based.
  • Use nunique on user_id to avoid double-counting users within a period.
Show solution
import pandas as pd
from io import StringIO

csv = StringIO('''user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50
''')

def month_floor(s):
    return s.dt.to_period('M').dt.to_timestamp()

def add_cohort_columns(df, user_col='user_id', date_col='order_date'):
    df = df.copy()
    df['OrderMonth'] = month_floor(df[date_col])
    first = (df.groupby(user_col)[date_col].min().rename('CohortMonth'))
    first = month_floor(first.to_frame()['CohortMonth']).to_frame()
    df = df.merge(first, left_on=user_col, right_index=True, how='left')
    df['cohort_index'] = (
        (df['OrderMonth'].dt.year - df['CohortMonth'].dt.year) * 12
        + (df['OrderMonth'].dt.month - df['CohortMonth'].dt.month)
        + 1
    )
    return df

df = pd.read_csv(csv, parse_dates=['signup_date','order_date'])
df = add_cohort_columns(df, 'user_id', 'order_date')
cohort_counts = (df.groupby(['CohortMonth','cohort_index'])['user_id']
                   .nunique()
                   .reset_index())
cohort_sizes = (cohort_counts[cohort_counts['cohort_index']==1]
                .set_index('CohortMonth')['user_id'])
retention = (cohort_counts
    .pivot(index='CohortMonth', columns='cohort_index', values='user_id')
    .divide(cohort_sizes, axis=0)
    .round(2)
    .fillna(0.0))
print(retention)
Exercise 2 — Revenue cohorts and ARPU

Goal: Using the same dataset, compute revenue per cohort and period, then ARPU per period.

Expected revenue totals:

cohort_index     1     2     3
CohortMonth                    
2023-01-01     50    25    10
2023-02-01     65    25     0
2023-03-01     50     0     0

Expected ARPU per period:

cohort_index      1      2     3
CohortMonth                       
2023-01-01    25.00  12.50  5.00
2023-02-01    32.50  12.50  0.00
2023-03-01    50.00   0.00  0.00
Hints
  • Group by ['CohortMonth','cohort_index'] and sum amount.
  • Divide revenue pivot by cohort_sizes to get ARPU.
  • Keep two decimals.
Show solution
# Reuse df from Exercise 1
a = (df.groupby(['CohortMonth','cohort_index'])['amount']
       .sum()
       .reset_index())
rev_table = a.pivot(index='CohortMonth', columns='cohort_index', values='amount').fillna(0)
cohort_sizes = df.groupby('CohortMonth')['user_id'].nunique()
arpu = rev_table.divide(cohort_sizes, axis=0).round(2)
print(rev_table)
print(arpu)

Exercise checklist

  • Dates are month-normalized before grouping
  • CohortMonth is per-user first order month
  • cohort_index is 1-based
  • Counts use nunique(user_id)
  • Retention table divides by cohort size row-wise

Common mistakes

  • Using count instead of nunique: double-counts repeat orders within a period.
  • 0-based cohort index: shifts all periods; always start at 1.
  • Mismatched month granularity: grouping by full date instead of month yields sparse, misleading tables.
  • Dividing by total users overall instead of cohort row size: makes rates incomparable across cohorts.
  • Mixing acquisition and sign-up definitions: be explicit whether CohortMonth is first sign-up or first order.
How to self-check
  • Sum of period-1 retention should be exactly 1.00 (100%) for each row.
  • Period N retention cannot exceed period N-1 for the same row (monotonic non-increasing by activity definition).
  • ARPU period 1 Ă— cohort size should equal period 1 revenue total.

Practical projects

  • Build a retention dashboard notebook: functions for cohort tables, retention %, ARPU, and cohort-size sparkline per row.
  • Campaign comparison: split users by acquisition source, then produce separate cohort tables to compare 2nd-month retention.
  • Revenue expansion: compute cumulative revenue per cohort and highlight when cohorts break even on CAC (use your own CAC inputs).

Learning path

  • Before: Data cleaning in pandas, date handling, groupby and pivot_table
  • Now: Cohort tables (retention and revenue)
  • Next: Rolling retention vs. classic period retention, survival curves, and funnel-to-retention links

Next steps

  • Parameterize your cohort code into reusable functions.
  • Add filters by product line or acquisition channel.
  • Format output for stakeholders: round rates, highlight key periods, add comments.

Mini challenge

Extend the retention code to compute repeat-purchase rate: number of users with at least 2 orders by period / cohort size. Show a table for periods 1–3.

Practice Exercises

2 exercises to complete

Instructions

Create a retention table (periods 1–3) from the dataset below. Show rates from 0 to 1 with 2 decimals.

user_id,signup_date,order_date,amount
1,2023-01-10,2023-01-10,20
1,2023-01-10,2023-02-05,15
1,2023-01-10,2023-03-12,10
2,2023-01-15,2023-01-20,30
2,2023-01-15,2023-02-20,10
3,2023-02-02,2023-02-10,25
3,2023-02-02,2023-03-01,25
4,2023-02-18,2023-02-18,40
5,2023-03-05,2023-03-08,50
  • Normalize to month, compute CohortMonth and 1-based cohort_index.
  • Use nunique on user_id for counts, pivot, then divide by cohort sizes.
Expected Output
Retention rates table: cohort_index 1 2 3 CohortMonth 2023-01-01 1.00 1.00 0.50 2023-02-01 1.00 0.50 0.00 2023-03-01 1.00 0.00 0.00

Have questions about Cohort Tables In Pandas?

AI Assistant

Ask questions about this tool