Why this matters
As a Data Analyst, you often need to summarize large datasets by groups: customers, regions, channels, cohorts. pandas groupby + agg/transform/filter lets you compute KPIs, normalize values within groups, and keep only groups that meet business rules.
- Marketing: Calculate conversion rate by campaign and country, then keep campaigns with enough impressions.
- Product: Identify features with high average rating and at least N reviews.
- Finance: Report revenue, orders, and unique customers by month and segment.
Concept explained simply
Think of groupby as arranging rows into labeled bins. Then you choose what to do with each bin:
- agg: collapse each bin to a smaller summary (count, sum, mean, custom).
- transform: compute a per-group value but return the original shape, aligned to each row (good for percentages, z-scores).
- filter: keep or drop entire groups based on a condition.
Mental model: split → apply → combine. groupby creates the split, your function is apply, and pandas recombines results.
Key patterns you’ll use
- Multiple aggregations per column with clear names using named aggregations.
- Different aggregations for different columns in one pass.
- transform to compute per-group totals/means that you can use per row.
- filter to keep groups that meet business thresholds (volume, quality).
Syntax quick reference
import pandas as pd
# Named aggregations
(df.groupby(['colA','colB'])
.agg(total=('value','sum'), avg=('value','mean'), n=('id','count'))
.reset_index())
# Multiple aggs on same column
(df.groupby('group')['value']
.agg(total='sum', avg='mean', q90=lambda s: s.quantile(0.9)))
# transform: returns same length as original
s = df.groupby('group')['value'].transform('sum')
# filter: keep whole groups that satisfy condition
filtered = df.groupby('group').filter(lambda g: g['value'].sum() > 100)
Worked examples
Example 1: Business summary with multiple aggregations
Goal: By region and category, compute total revenue, average order value, number of orders, and unique customers.
Show code and output
import pandas as pd
df = pd.DataFrame({
'region': ['North','North','South','South','South'],
'category': ['A','A','A','B','B'],
'amount': [100,200,50,300,150],
'order_id': [1,2,3,4,5],
'customer_id': [10,10,20,30,30]
})
summary = (df.groupby(['region','category'])
.agg(total_amount=('amount','sum'),
avg_amount=('amount','mean'),
orders=('order_id','count'),
unique_customers=('customer_id','nunique'))
.reset_index()
.sort_values(['region','category']))
print(summary)
Expected shape: one row per region-category with the four metrics.
Example 2: Within-group normalization using transform
Goal: Compute percent of region total and a per-user z-score for order amount.
Show code and output
# Percent of region total
region_total = df.groupby('region')['amount'].transform('sum')
df['pct_region'] = df['amount'] / region_total
# Z-score by customer (ddof=0 for population std)
df['amount_z_by_customer'] = df.groupby('customer_id')['amount'] \
.transform(lambda s: (s - s.mean()) / s.std(ddof=0))
print(df[['region','customer_id','amount','pct_region','amount_z_by_customer']])
Expected: pct_region sums to 1 within each region; z-scores are centered around 0 per customer.
Example 3: Keep only strong products (group filter)
Goal: Keep products with at least 3 ratings and average rating ≥ 4.0.
Show code and output
df2 = pd.DataFrame({
'product': ['P1','P1','P1','P2','P2','P3'],
'rating': [4.5, 4.2, 3.9, 4.8, 4.9, 3.0]
})
strong = df2.groupby('product').filter(
lambda g: (len(g) >= 3) and (g['rating'].mean() >= 4.0)
)
print(strong)
Expected: Only rows for products that meet both conditions remain.
Exercises you can run
Mirror of the exercises below. Run them in your own environment and compare with the expected outputs.
- Exercise 1: Multi-aggregation with named outputs.
- Exercise 2: transform for within-group metrics.
- Exercise 3: filter to enforce business thresholds.
Common mistakes and self-check
- Mistake: Using agg when you need per-row aligned values. Self-check: If you need the same number of rows as input, use transform.
- Mistake: Forgetting reset_index() after groupby.agg, then being confused by index. Self-check: Do you need grouping keys as columns? Use as_index=False or reset_index().
- Mistake: filter condition returns a Series. Self-check: Your function must return a single True/False per group.
- Mistake: Mixing row-wise operations inside agg. Self-check: agg reduces; avoid referencing outside the grouped columns in reductions.
- Mistake: Dividing by the global total instead of group total. Self-check: Compute totals via transform per group.
Who this is for
- Data Analysts and BI professionals summarizing datasets by segments.
- Anyone preparing dashboards, cohort analyses, or A/B test reports.
Prerequisites
- Basic pandas: DataFrame creation, column selection, filtering.
- Python basics: functions and lambdas.
- Comfort with descriptive statistics (mean, count, sum).
Learning path
- Review groupby basics: what forms the group keys and how indexes behave.
- Practice named aggregations with multiple metrics per group.
- Use transform to compute per-group totals/means and create ratio columns.
- Apply filter to enforce volume or quality thresholds on groups.
- Combine: filter → agg → transform for robust reporting.
Practical projects
- E-commerce KPI board: By country and device, compute revenue, AOV, orders, unique buyers; add each order’s percent of country total.
- Product quality screen: Keep products with ≥ 20 reviews and avg rating ≥ 4.2; rank by revenue.
- Sales team report: For each rep, calculate monthly totals and a z-score of deal size within rep; highlight outliers.
Mini challenge
Given orders(user_id, region, amount, channel), create:
- A summary per region-channel with total_amount, avg_amount, orders, unique_users.
- A column on the original DataFrame with percent of channel total within the same region.
- Filter to keep only region-channel pairs with ≥ 100 orders and total_amount ≥ 50,000.
Tip: Use a multi-key groupby for both agg and transform.
Next steps
- Rerun your last team report using named aggregations for clarity and fewer post-processing steps.
- Add at least one transform-based metric (e.g., percent-of-group) to your next dashboard.
- Use filter to enforce a minimum volume for any KPI you present.
Quick Test
Take the quick test below to check your understanding. Available to everyone; only logged-in users get saved progress.