luvv to helpDiscover the Best Free Online Tools
Topic 6 of 9

Aggregation Features By Entity

Learn Aggregation Features By Entity for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Aggregations by entity turn raw events into stable, predictive signals. In real Data Scientist work you will:

  • Score customers for churn or CLV by summarizing their past activity.
  • Detect fraud by aggregating card, device, or IP behavior over short time windows.
  • Forecast demand by aggregating product-level sales across days/weeks.
  • Improve ranking/recommendations using per-user and per-item interaction stats.
Real tasks you might get
  • "Build customer-level features from the transactions table for a churn model (no leakage)."
  • "Compute device- and merchant-level rolling stats for fraud rules within 24h/7d."
  • "Create per-seller conversion and cancellation rates to monitor marketplace quality."

Concept explained simply

Aggregation features by entity are statistics computed over groups (entities) such as customer_id, product_id, merchant_id, session_id, or device_id. You group rows by the entity and summarize with functions like count, sum, mean, std, min, max, nunique, ratios, and rolling-window versions over time.

Common entity keys: user/customer, item/product, merchant/seller, card/device/IP, session/order. Common windows: lifetime-to-date, last N days, last K events, or anchored windows (up to a prediction cutoff time).

Mental model

  • Define the entity: the unit you will score (user, card, product).
  • Define the horizon: what is “past” relative to prediction time (avoid leakage).
  • Summarize behavior: use simple stats first; add diversity, rates, and ratios next.
  • Join back: merge aggregated features to the modeling table keyed by entity (and time if needed).
Useful aggregation types
  • Volume: count, sum
  • Central tendency and spread: mean, median, std
  • Diversity: nunique, entropy-like proxies (e.g., nunique / count)
  • Rates and ratios: percent_of_total, share_by_category, conversion_rate
  • Time-aware: rolling counts/sums, recency (days since last event), frequency

Worked examples

Example 1: E-commerce customer churn

Goal: Predict churn on date T. Build customer-level features from transactions strictly before T.

# Pseudocode (pandas-style)
past = txns[txns.timestamp < T]
by_user = past.groupby('customer_id')
features = by_user.agg(
    txn_count=('amount','count'),
    amount_sum=('amount','sum'),
    amount_mean=('amount','mean'),
    amount_std=('amount','std'),
)
# Diversity and behavior
cat_diversity = by_user['category_id'].nunique().rename('cat_nunique')
features = features.join(cat_diversity)
# Ratios
features['avg_basket'] = features['amount_sum'] / features['txn_count']

Why it helps: churners often have lower recent frequency, spend, and diversity.

Example 2: Fraud – per card rolling windows

Goal: For each transaction, compute features from the card's past behavior.

# For each row r at time t: count prior txns within 24h, mean amount in 7d
# Ensure "strictly past": exclude the current row's data
  • card_txn_count_24h: unusual bursts may indicate fraud.
  • card_amount_mean_7d: sudden deviation can be suspicious.
  • distinct_merchants_24h: too many unique merchants in a short time is risky.

Example 3: Marketplace – per merchant quality

Goal: Monitor sellers and predict bad outcomes.

  • merchant_conversion_rate_30d = purchases_30d / clicks_30d
  • merchant_cancel_rate_30d = cancels_30d / orders_30d
  • repeat_customer_rate_90d = repeat_buyers_90d / buyers_90d

These rates stabilize raw counts and support ranking, monitoring, and ML features.

Step-by-step: Build aggregation features by entity

  1. Choose the prediction unit and time reference
    • Example: Predict churn per customer at date T; predict fraud per transaction at time t.
  2. Define the entity keys
    • Examples: customer_id, card_id, product_id, merchant_id.
  3. Select windows and filters
    • Lifetime-to-date, last 7/30/90 days, last K events.
    • Always use data strictly before the prediction time.
  4. Pick aggregations
    • Start simple: count, sum, mean, std, nunique.
    • Add ratios/rates for stability (e.g., conversion_rate).
  5. Compute aggregates
    • Group by entity (and time window if rolling) and summarize.
  6. Join to the modeling table
    • Merge on entity key (and time if needed). Validate row counts and nulls.
  7. Validate leakage and stability
    • Spot-check timestamps and confirm no future info is used.
    • Check distribution drift across folds/time.
Time window tips
  • Short windows (24h/7d) capture recent spikes and bursts.
  • Longer windows (30–180d) capture habit and seasonality.
  • Use multiple windows to cover multiple time scales.
Ratios and rates that work well
  • entity_category_share = amount_in_cat / total_amount
  • conversion_rate = buys / visits (with smoothing if sparse)
  • repeat_rate = repeat_users / users

Who this is for

  • Data Scientists building features for supervised models.
  • Analytics Engineers preparing semantic layers for ML.
  • ML Engineers productionizing feature pipelines.

Prerequisites

  • Comfort with grouping and joining in Python/pandas or SQL.
  • Basic time handling (timestamps, time zones, rolling windows).
  • Understanding of train/validation splits and data leakage.

Learning path

  1. Review grouping, joins, and datetime operations.
  2. Implement lifetime and simple recent-window aggregates.
  3. Add rolling, diversity, and ratio features.
  4. Harden against leakage; validate with time-based CV.
  5. Optimize and document features for reuse.

Common mistakes and self-checks

  • Leakage: using events on/after prediction time.
    • Self-check: pick a random row; verify all contributing events are strictly earlier.
  • Cardinality blow-up: aggregating by overly granular keys (e.g., order_id).
    • Self-check: number of groups should match intended entity scale.
  • Sparse rates: unstable ratios when denominators are small.
    • Fix: add smoothing or minimum activity thresholds.
  • Double counting after joins.
    • Self-check: row counts before/after join; use one-to-one keys when expected.
  • Window misalignment across folds.
    • Self-check: recompute aggregates within each fold/time-split.
  • Nulls from inactive entities.
    • Fix: fillna with safe defaults (0 for counts/sums; global mean for averages, with caution).

Hands-on exercises

Complete the exercises below. Solutions are provided in collapsible sections. After finishing, use the checklist to self-evaluate.

Exercise 1 — Customer 90-day aggregates

You have a transactions dataset with columns: customer_id, timestamp, amount, merchant_id, and weekday (0–6). Using a cutoff date T, compute per-customer features from the 90 days strictly before T:

  • txn_count_90d
  • amount_sum_90d, amount_mean_90d, amount_std_90d
  • distinct_merchants_90d
  • pct_weekend_90d (share of transactions on Sat/Sun)
  • share_top_cat_90d (optional if category is available): top category share of amount

Return one row per customer active in the last 90 days.

Exercise 2 — Per-transaction rolling features (fraud)

Given a stream of transactions with card_id, timestamp, amount, and merchant_id, compute for each transaction:

  • card_txn_count_24h: number of prior transactions for the same card in the past 24 hours.
  • card_amount_mean_7d: mean amount of prior transactions in the past 7 days.
  • card_uniq_merchants_24h: number of distinct merchants in the past 24 hours (prior only).

Exclude the current transaction from all windows. Keep output aligned to the original transaction rows.

Exercise checklist

  • [ ] All features use strictly past data relative to T or the current row's timestamp.
  • [ ] Row counts match expectations (no accidental row duplication).
  • [ ] Reasonable defaults after joins (e.g., 0 for counts when no history).
  • [ ] Verified at least one manual example by hand to confirm correctness.

Practical projects

  • E-commerce churn sandbox: Build 30d/90d user aggregates, train a simple model, and compare AUC with and without aggregation features.
  • Fraud mini-pipeline: Compute 24h/7d card/device rolling features and evaluate precision@k on a labeled set.
  • Product demand notebook: Create per-product weekly aggregates and forecast next-week sales with a baseline model.

Next steps

  • Extend with cross-entity features (e.g., user-vs-category shares and rates).
  • Add smoothing for sparse rates (e.g., Bayesian conversion rate).
  • Set up time-based cross-validation and feature drift checks.

Quick test (no login needed)

Anyone can take the test. Only logged-in users get saved progress.

Mini challenge

Pick an entity in your data (user, product, or merchant). Create three lifetime features and three recent-window features (e.g., last 7/30 days). Verify no leakage by manually checking timestamps for two random entities. Write one sentence for each feature explaining why it might help your target.

Practice Exercises

2 exercises to complete

Instructions

You have a transactions dataset with columns: customer_id, timestamp, amount, merchant_id, and weekday (0–6). Using a cutoff date T, compute per-customer features from the 90 days strictly before T:

  • txn_count_90d
  • amount_sum_90d, amount_mean_90d, amount_std_90d
  • distinct_merchants_90d
  • pct_weekend_90d (share of transactions on Sat/Sun)
  • share_top_cat_90d (optional if category is available): top category share of amount

Return one row per customer active in the last 90 days.

Expected Output
A customer-level table with one row per customer active in the last 90 days and the requested columns. All features computed from data with timestamp < T.

Aggregation Features By Entity — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Aggregation Features By Entity?

AI Assistant

Ask questions about this tool