How to learn Cohort Exploration for Exploratory Analysis in Data Analyst for free

Who this is for

Junior and mid-level Data Analysts who need to uncover retention, repeat behavior, and lifecycle patterns.
Product and Marketing analysts validating experiments or campaign quality over time.
Anyone moving from simple aggregates to time-relative, lifecycle-aware analysis.

Why this matters

Cohort exploration groups users (or accounts, orders) by a shared starting event (for example, sign-up month) and follows them over equal time intervals (Week 0, Week 1, etc.). This lets you answer questions such as:

Are new users from January activating and staying longer than those from December?
Do customers from campaign A repurchase more by Day 30 than campaign B?
Did our onboarding change improve Week 4 retention?

These are core tasks in Data Analyst roles and directly inform product, marketing, and revenue decisions.

Concept explained simply

Think of each cohort as a graduating class. Everyone in the same class starts together, and you check how they are doing at the 1st, 2nd, 3rd weeks after starting. You then compare classes to see whether newer classes are doing better.

Mental model

Rows: Cohorts (e.g., 2025-01 sign-ups).
Columns: Time since start (e.g., Week 0, Week 1, Week 2...).
Cells: A metric for that cohort at that period (e.g., % of users active that week, or average revenue per user).
Read across a row to see lifecycle decay/improvement; read down a column to compare cohorts at the same lifecycle moment; read diagonals to spot seasonality or data issues.

Data you need

A unique entity ID (user_id or account_id).
A cohort start event date (e.g., signup_date, first_purchase_date, first_active_date).
Activity events with timestamps (e.g., login, session, purchase) to measure outcomes over time.
Optional value fields (e.g., revenue) if you track monetary metrics.

Minimum columns

user_id, signup_date, event_date, event_type, amount(optional)

Steps to build a cohort view

Define your cohort key: Choose the start event and the grain (by day, week, or month). Example: cohort_month = first day of signup month.
Define the period index: For each event, compute periods since start. Example: week_index = floor((event_date - signup_date) / 7).
Aggregate: For each cohort and period index, compute the metric (e.g., distinct active users, repeat purchasers, revenue).
Normalize (optional but common): Divide by the cohort size to get rates (e.g., retention %).
Visualize: Matrix with cohorts as rows and period indexes as columns. Shade or color in your BI tool, or keep as a table for exploration.

SQL-style outline

-- 1) Identify cohort for each user
WITH users AS (
  SELECT user_id,
         DATE_TRUNC('month', MIN(signup_date)) AS cohort_month,
         MIN(signup_date) AS cohort_start
  FROM user_signups
  GROUP BY 1
),
-- 2) Map events to periods since cohort start
user_events AS (
  SELECT e.user_id,
         u.cohort_month,
         DATE_DIFF('day', u.cohort_start, e.event_date)/7 AS week_index_raw,
         FLOOR(DATE_DIFF('day', u.cohort_start, e.event_date)/7) AS week_index
  FROM events e
  JOIN users u USING (user_id)
  WHERE e.event_date >= u.cohort_start
),
-- 3) Aggregate active users per cohort-week
active_cohort AS (
  SELECT cohort_month,
         week_index,
         COUNT(DISTINCT user_id) AS active_users
  FROM user_events
  WHERE week_index BETWEEN 0 AND 12
  GROUP BY 1,2
),
-- 4) Cohort sizes
sizes AS (
  SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
  FROM users
  GROUP BY 1
)
SELECT a.cohort_month,
       a.week_index,
       a.active_users,
       s.cohort_size,
       ROUND(100.0 * a.active_users / s.cohort_size, 1) AS retention_pct
FROM active_cohort a
JOIN sizes s USING (cohort_month)
ORDER BY cohort_month, week_index;

Spreadsheet-style outline

Create a table with user_id and signup_date. Add cohort_month = first day of signup month.
List events with event_date. Add week_index = INT((event_date - signup_date)/7).
Pivot: Rows = cohort_month, Columns = week_index, Values = count of distinct user_id (or count if one row per user-week).
Add another pivot for cohort size (count of distinct user_id in each cohort_month).
Compute retention % = active_users / cohort_size for each cell.

Worked examples

Example 1: User retention after signup

Goal: Measure percentage of users active in Week 0–4 after signup.

Cohort: signup_month
Metric: active users (distinct) in each week_index
Normalize by cohort size

Interpretation: If 2025-01 shows 100%, 45%, 35%, 28%, 24% across Weeks 0–4, your activation cliff is Week 1. If 2025-02 improves Week 1 to 52%, onboarding likely helped.

Example 2: Repeat purchase rate

Goal: Of customers who made their first purchase in a month, what share make another purchase by Week 4?

Cohort: first_purchase_month
Metric: distinct purchasers in each week_index (excluding the first purchase event)
Rate: repeat_purchasers / cohort_size

Interpretation: If Week 2 improves for newer cohorts, your cross-sell may be working.

Example 3: Account expansion (B2B)

Goal: Track Net Revenue Retention by quarter since go-live.

Cohort: go_live_quarter (accounts)
Metric: revenue in each quarter_index
Rate: revenue_in_period / revenue_in_period0 (or NRR conventions)

Interpretation: If Quarter 2+ rises above 100% for recent cohorts, your expansion motions are improving.

Common mistakes and self-checks

Mixing calendar time with cohort time: Ensure columns are time since cohort start, not absolute weeks.
Wrong cohort anchor: Be explicit (signup vs first purchase vs first activation). Recompute if your question changes.
Inconsistent denominators: Retention % should divide by the original cohort size, not the prior week’s size unless you want survival.
Event leakage before cohort start: Filter events earlier than start.
Sparse cohorts: Very small cohorts produce noisy rates. Aggregate by month instead of day if needed.

Self-check quick list

Do row 0 values equal 100% when normalized?
Does cohort size remain constant across columns?
Do diagonals reveal seasonality rather than lifecycle?
Are metric definitions documented in one sentence?

Practical projects

Retention matrix: Build a 12-week user retention matrix for two signup months; annotate Week 1 drop-off and propose a quick onboarding experiment.
Repeat purchase: Measure 8-week repeat rates by acquisition channel; identify the top decile cohort and hypothesize why.
Revenue cohorts: Compute 4-quarter revenue-per-account since go-live; flag cohorts with negative trend and list likely causes.

Learning path

Before this: Basic SQL/spreadsheet pivots; date functions; distinct counts.
Now: Cohort definitions, period indexing, normalization, interpretation.
Next: Survival analysis, lifecycle segmentation, retention drivers, causal inference basics.

Prerequisites

Date arithmetic (e.g., DATEDIFF, DATE_TRUNC, integer division).
Aggregation and distinct counts.
Pivot tables or group-by with CASE.

Mini challenge

Choose one metric and rebuild it with a different cohort anchor. Example: change from signup_month to first_activation_week. Do your conclusions change? Write 3 bullets on why.

Exercises

These mirror the exercises below. Do them in SQL or spreadsheets.

Exercise ex1 — Build a 5-week retention matrix

Dataset (copy into your tool):

user_id,signup_date,event_date,event_type
U1,2025-01-05,2025-01-05,login
U1,2025-01-05,2025-01-11,login
U1,2025-01-05,2025-01-26,login
U2,2025-01-20,2025-01-20,login
U2,2025-01-20,2025-01-27,login
U2,2025-01-20,2025-02-10,login
U3,2025-02-02,2025-02-02,login
U3,2025-02-02,2025-02-16,login
U4,2025-02-10,2025-02-10,login
U4,2025-02-10,2025-03-02,login

Cohort: signup_month
Period: week_index = INT((event_date - signup_date)/7)
Metric: count distinct user_id active in week_index 0..4
Output: retention % matrix by cohort row and week_index columns

Hints

Compute cohort_month = first day of signup month.
Compute week_index using integer division.
Get cohort_size = distinct users per cohort_month.
Pivot active users per cohort_month and week_index, then divide by cohort_size.

Expected output shape

cohort_month | W0   W1   W2   W3   W4
2025-01      |100%  67%  33%  33%   0%
2025-02      |100%  50%  50%  25%   0%

Next steps

Document your cohort definitions and period logic in a short metrics sheet.
Automate a weekly refresh and watch new cohorts vs. the last 3 months.
Layer segmentation (channel, plan, region) to see which cohorts drive change.

Quick Test

Available to everyone; only logged-in users get saved progress.

Instructions

Using the dataset below, compute a 5-week retention matrix by signup_month. Define week_index = INT((event_date - signup_date)/7). For each cohort_month and week_index (0..4), count distinct active users and divide by the cohort size to get retention %.

user_id,signup_date,event_date,event_type
U1,2025-01-05,2025-01-05,login
U1,2025-01-05,2025-01-11,login
U1,2025-01-05,2025-01-26,login
U2,2025-01-20,2025-01-20,login
U2,2025-01-20,2025-01-27,login
U2,2025-01-20,2025-02-10,login
U3,2025-02-02,2025-02-02,login
U3,2025-02-02,2025-02-16,login
U4,2025-02-10,2025-02-10,login
U4,2025-02-10,2025-03-02,login

Compute cohort_month from signup_date (first day of month).
Compute week_index for each event relative to signup_date.
Aggregate active users per cohort_month and week_index.
Compute retention % using cohort size.

Menu

Cohort Exploration

Table of Contents