Why this matters
Aggregations by entity turn raw events into stable, predictive signals. In real Data Scientist work you will:
- Score customers for churn or CLV by summarizing their past activity.
- Detect fraud by aggregating card, device, or IP behavior over short time windows.
- Forecast demand by aggregating product-level sales across days/weeks.
- Improve ranking/recommendations using per-user and per-item interaction stats.
Real tasks you might get
- "Build customer-level features from the transactions table for a churn model (no leakage)."
- "Compute device- and merchant-level rolling stats for fraud rules within 24h/7d."
- "Create per-seller conversion and cancellation rates to monitor marketplace quality."
Concept explained simply
Aggregation features by entity are statistics computed over groups (entities) such as customer_id, product_id, merchant_id, session_id, or device_id. You group rows by the entity and summarize with functions like count, sum, mean, std, min, max, nunique, ratios, and rolling-window versions over time.
Common entity keys: user/customer, item/product, merchant/seller, card/device/IP, session/order. Common windows: lifetime-to-date, last N days, last K events, or anchored windows (up to a prediction cutoff time).
Mental model
- Define the entity: the unit you will score (user, card, product).
- Define the horizon: what is “past” relative to prediction time (avoid leakage).
- Summarize behavior: use simple stats first; add diversity, rates, and ratios next.
- Join back: merge aggregated features to the modeling table keyed by entity (and time if needed).
Useful aggregation types
- Volume: count, sum
- Central tendency and spread: mean, median, std
- Diversity: nunique, entropy-like proxies (e.g., nunique / count)
- Rates and ratios: percent_of_total, share_by_category, conversion_rate
- Time-aware: rolling counts/sums, recency (days since last event), frequency
Worked examples
Example 1: E-commerce customer churn
Goal: Predict churn on date T. Build customer-level features from transactions strictly before T.
# Pseudocode (pandas-style)
past = txns[txns.timestamp < T]
by_user = past.groupby('customer_id')
features = by_user.agg(
txn_count=('amount','count'),
amount_sum=('amount','sum'),
amount_mean=('amount','mean'),
amount_std=('amount','std'),
)
# Diversity and behavior
cat_diversity = by_user['category_id'].nunique().rename('cat_nunique')
features = features.join(cat_diversity)
# Ratios
features['avg_basket'] = features['amount_sum'] / features['txn_count']
Why it helps: churners often have lower recent frequency, spend, and diversity.
Example 2: Fraud – per card rolling windows
Goal: For each transaction, compute features from the card's past behavior.
# For each row r at time t: count prior txns within 24h, mean amount in 7d
# Ensure "strictly past": exclude the current row's data
- card_txn_count_24h: unusual bursts may indicate fraud.
- card_amount_mean_7d: sudden deviation can be suspicious.
- distinct_merchants_24h: too many unique merchants in a short time is risky.
Example 3: Marketplace – per merchant quality
Goal: Monitor sellers and predict bad outcomes.
- merchant_conversion_rate_30d = purchases_30d / clicks_30d
- merchant_cancel_rate_30d = cancels_30d / orders_30d
- repeat_customer_rate_90d = repeat_buyers_90d / buyers_90d
These rates stabilize raw counts and support ranking, monitoring, and ML features.
Step-by-step: Build aggregation features by entity
- Choose the prediction unit and time reference
- Example: Predict churn per customer at date T; predict fraud per transaction at time t.
- Define the entity keys
- Examples: customer_id, card_id, product_id, merchant_id.
- Select windows and filters
- Lifetime-to-date, last 7/30/90 days, last K events.
- Always use data strictly before the prediction time.
- Pick aggregations
- Start simple: count, sum, mean, std, nunique.
- Add ratios/rates for stability (e.g., conversion_rate).
- Compute aggregates
- Group by entity (and time window if rolling) and summarize.
- Join to the modeling table
- Merge on entity key (and time if needed). Validate row counts and nulls.
- Validate leakage and stability
- Spot-check timestamps and confirm no future info is used.
- Check distribution drift across folds/time.
Time window tips
- Short windows (24h/7d) capture recent spikes and bursts.
- Longer windows (30–180d) capture habit and seasonality.
- Use multiple windows to cover multiple time scales.
Ratios and rates that work well
- entity_category_share = amount_in_cat / total_amount
- conversion_rate = buys / visits (with smoothing if sparse)
- repeat_rate = repeat_users / users
Who this is for
- Data Scientists building features for supervised models.
- Analytics Engineers preparing semantic layers for ML.
- ML Engineers productionizing feature pipelines.
Prerequisites
- Comfort with grouping and joining in Python/pandas or SQL.
- Basic time handling (timestamps, time zones, rolling windows).
- Understanding of train/validation splits and data leakage.
Learning path
- Review grouping, joins, and datetime operations.
- Implement lifetime and simple recent-window aggregates.
- Add rolling, diversity, and ratio features.
- Harden against leakage; validate with time-based CV.
- Optimize and document features for reuse.
Common mistakes and self-checks
- Leakage: using events on/after prediction time.
- Self-check: pick a random row; verify all contributing events are strictly earlier.
- Cardinality blow-up: aggregating by overly granular keys (e.g., order_id).
- Self-check: number of groups should match intended entity scale.
- Sparse rates: unstable ratios when denominators are small.
- Fix: add smoothing or minimum activity thresholds.
- Double counting after joins.
- Self-check: row counts before/after join; use one-to-one keys when expected.
- Window misalignment across folds.
- Self-check: recompute aggregates within each fold/time-split.
- Nulls from inactive entities.
- Fix: fillna with safe defaults (0 for counts/sums; global mean for averages, with caution).
Hands-on exercises
Complete the exercises below. Solutions are provided in collapsible sections. After finishing, use the checklist to self-evaluate.
Exercise 1 — Customer 90-day aggregates
You have a transactions dataset with columns: customer_id, timestamp, amount, merchant_id, and weekday (0–6). Using a cutoff date T, compute per-customer features from the 90 days strictly before T:
- txn_count_90d
- amount_sum_90d, amount_mean_90d, amount_std_90d
- distinct_merchants_90d
- pct_weekend_90d (share of transactions on Sat/Sun)
- share_top_cat_90d (optional if category is available): top category share of amount
Return one row per customer active in the last 90 days.
Exercise 2 — Per-transaction rolling features (fraud)
Given a stream of transactions with card_id, timestamp, amount, and merchant_id, compute for each transaction:
- card_txn_count_24h: number of prior transactions for the same card in the past 24 hours.
- card_amount_mean_7d: mean amount of prior transactions in the past 7 days.
- card_uniq_merchants_24h: number of distinct merchants in the past 24 hours (prior only).
Exclude the current transaction from all windows. Keep output aligned to the original transaction rows.
Exercise checklist
- [ ] All features use strictly past data relative to T or the current row's timestamp.
- [ ] Row counts match expectations (no accidental row duplication).
- [ ] Reasonable defaults after joins (e.g., 0 for counts when no history).
- [ ] Verified at least one manual example by hand to confirm correctness.
Practical projects
- E-commerce churn sandbox: Build 30d/90d user aggregates, train a simple model, and compare AUC with and without aggregation features.
- Fraud mini-pipeline: Compute 24h/7d card/device rolling features and evaluate precision@k on a labeled set.
- Product demand notebook: Create per-product weekly aggregates and forecast next-week sales with a baseline model.
Next steps
- Extend with cross-entity features (e.g., user-vs-category shares and rates).
- Add smoothing for sparse rates (e.g., Bayesian conversion rate).
- Set up time-based cross-validation and feature drift checks.
Quick test (no login needed)
Anyone can take the test. Only logged-in users get saved progress.
Mini challenge
Pick an entity in your data (user, product, or merchant). Create three lifetime features and three recent-window features (e.g., last 7/30 days). Verify no leakage by manually checking timestamps for two random entities. Write one sentence for each feature explaining why it might help your target.