How to learn Aggregation Features By Entity for Feature Engineering in Data Scientist for free

Why this matters

Aggregations by entity turn raw events into stable, predictive signals. In real Data Scientist work you will:

Score customers for churn or CLV by summarizing their past activity.
Detect fraud by aggregating card, device, or IP behavior over short time windows.
Forecast demand by aggregating product-level sales across days/weeks.
Improve ranking/recommendations using per-user and per-item interaction stats.

Real tasks you might get

"Build customer-level features from the transactions table for a churn model (no leakage)."
"Compute device- and merchant-level rolling stats for fraud rules within 24h/7d."
"Create per-seller conversion and cancellation rates to monitor marketplace quality."

Concept explained simply

Aggregation features by entity are statistics computed over groups (entities) such as customer_id, product_id, merchant_id, session_id, or device_id. You group rows by the entity and summarize with functions like count, sum, mean, std, min, max, nunique, ratios, and rolling-window versions over time.

Common entity keys: user/customer, item/product, merchant/seller, card/device/IP, session/order. Common windows: lifetime-to-date, last N days, last K events, or anchored windows (up to a prediction cutoff time).

Mental model

Define the entity: the unit you will score (user, card, product).
Define the horizon: what is “past” relative to prediction time (avoid leakage).
Summarize behavior: use simple stats first; add diversity, rates, and ratios next.
Join back: merge aggregated features to the modeling table keyed by entity (and time if needed).

Useful aggregation types

Volume: count, sum
Central tendency and spread: mean, median, std
Diversity: nunique, entropy-like proxies (e.g., nunique / count)
Rates and ratios: percent_of_total, share_by_category, conversion_rate
Time-aware: rolling counts/sums, recency (days since last event), frequency

Worked examples

Example 1: E-commerce customer churn

Goal: Predict churn on date T. Build customer-level features from transactions strictly before T.

# Pseudocode (pandas-style)
past = txns[txns.timestamp < T]
by_user = past.groupby('customer_id')
features = by_user.agg(
    txn_count=('amount','count'),
    amount_sum=('amount','sum'),
    amount_mean=('amount','mean'),
    amount_std=('amount','std'),
)
# Diversity and behavior
cat_diversity = by_user['category_id'].nunique().rename('cat_nunique')
features = features.join(cat_diversity)
# Ratios
features['avg_basket'] = features['amount_sum'] / features['txn_count']

Why it helps: churners often have lower recent frequency, spend, and diversity.

Example 2: Fraud – per card rolling windows

Goal: For each transaction, compute features from the card's past behavior.

# For each row r at time t: count prior txns within 24h, mean amount in 7d
# Ensure "strictly past": exclude the current row's data

card_txn_count_24h: unusual bursts may indicate fraud.
card_amount_mean_7d: sudden deviation can be suspicious.
distinct_merchants_24h: too many unique merchants in a short time is risky.

Example 3: Marketplace – per merchant quality

Goal: Monitor sellers and predict bad outcomes.

merchant_conversion_rate_30d = purchases_30d / clicks_30d
merchant_cancel_rate_30d = cancels_30d / orders_30d
repeat_customer_rate_90d = repeat_buyers_90d / buyers_90d

These rates stabilize raw counts and support ranking, monitoring, and ML features.

Step-by-step: Build aggregation features by entity

Choose the prediction unit and time reference
- Example: Predict churn per customer at date T; predict fraud per transaction at time t.
Define the entity keys
- Examples: customer_id, card_id, product_id, merchant_id.
Select windows and filters
- Lifetime-to-date, last 7/30/90 days, last K events.
- Always use data strictly before the prediction time.
Pick aggregations
- Start simple: count, sum, mean, std, nunique.
- Add ratios/rates for stability (e.g., conversion_rate).
Compute aggregates
- Group by entity (and time window if rolling) and summarize.
Join to the modeling table
- Merge on entity key (and time if needed). Validate row counts and nulls.
Validate leakage and stability
- Spot-check timestamps and confirm no future info is used.
- Check distribution drift across folds/time.

Time window tips

Short windows (24h/7d) capture recent spikes and bursts.
Longer windows (30–180d) capture habit and seasonality.
Use multiple windows to cover multiple time scales.

Ratios and rates that work well

entity_category_share = amount_in_cat / total_amount
conversion_rate = buys / visits (with smoothing if sparse)
repeat_rate = repeat_users / users

Who this is for

Data Scientists building features for supervised models.
Analytics Engineers preparing semantic layers for ML.
ML Engineers productionizing feature pipelines.

Prerequisites

Comfort with grouping and joining in Python/pandas or SQL.
Basic time handling (timestamps, time zones, rolling windows).
Understanding of train/validation splits and data leakage.

Learning path

Review grouping, joins, and datetime operations.
Implement lifetime and simple recent-window aggregates.
Add rolling, diversity, and ratio features.
Harden against leakage; validate with time-based CV.
Optimize and document features for reuse.

Common mistakes and self-checks

Leakage: using events on/after prediction time.
- Self-check: pick a random row; verify all contributing events are strictly earlier.
Cardinality blow-up: aggregating by overly granular keys (e.g., order_id).
- Self-check: number of groups should match intended entity scale.
Sparse rates: unstable ratios when denominators are small.
- Fix: add smoothing or minimum activity thresholds.
Double counting after joins.
- Self-check: row counts before/after join; use one-to-one keys when expected.
Window misalignment across folds.
- Self-check: recompute aggregates within each fold/time-split.
Nulls from inactive entities.
- Fix: fillna with safe defaults (0 for counts/sums; global mean for averages, with caution).

Hands-on exercises

Complete the exercises below. Solutions are provided in collapsible sections. After finishing, use the checklist to self-evaluate.

Exercise 1 — Customer 90-day aggregates

You have a transactions dataset with columns: customer_id, timestamp, amount, merchant_id, and weekday (0–6). Using a cutoff date T, compute per-customer features from the 90 days strictly before T:

txn_count_90d
amount_sum_90d, amount_mean_90d, amount_std_90d
distinct_merchants_90d
pct_weekend_90d (share of transactions on Sat/Sun)
share_top_cat_90d (optional if category is available): top category share of amount

Return one row per customer active in the last 90 days.

Exercise 2 — Per-transaction rolling features (fraud)

Given a stream of transactions with card_id, timestamp, amount, and merchant_id, compute for each transaction:

card_txn_count_24h: number of prior transactions for the same card in the past 24 hours.
card_amount_mean_7d: mean amount of prior transactions in the past 7 days.
card_uniq_merchants_24h: number of distinct merchants in the past 24 hours (prior only).

Exclude the current transaction from all windows. Keep output aligned to the original transaction rows.

Exercise checklist

[ ] All features use strictly past data relative to T or the current row's timestamp.
[ ] Row counts match expectations (no accidental row duplication).
[ ] Reasonable defaults after joins (e.g., 0 for counts when no history).
[ ] Verified at least one manual example by hand to confirm correctness.

Practical projects

E-commerce churn sandbox: Build 30d/90d user aggregates, train a simple model, and compare AUC with and without aggregation features.
Fraud mini-pipeline: Compute 24h/7d card/device rolling features and evaluate precision@k on a labeled set.
Product demand notebook: Create per-product weekly aggregates and forecast next-week sales with a baseline model.

Next steps

Extend with cross-entity features (e.g., user-vs-category shares and rates).
Add smoothing for sparse rates (e.g., Bayesian conversion rate).
Set up time-based cross-validation and feature drift checks.

Quick test (no login needed)

Anyone can take the test. Only logged-in users get saved progress.

Mini challenge

Pick an entity in your data (user, product, or merchant). Create three lifetime features and three recent-window features (e.g., last 7/30 days). Verify no leakage by manually checking timestamps for two random entities. Write one sentence for each feature explaining why it might help your target.

Menu

Aggregation Features By Entity

Table of Contents