Why this matters
Clear, reusable feature definitions reduce duplicate work, prevent data leakage, and keep training and serving consistent. As a Machine Learning Engineer, you will:
- Design canonical features once and reuse them across multiple models and teams.
- Guarantee point-in-time correctness to avoid leakage.
- Align offline training and online inference with the same definitions.
- Version and evolve features without breaking downstream consumers.
Concept explained simply
A feature definition is a precise recipe for how to compute a feature. It includes its name, which entity it belongs to (like user_id or item_id), the time grain, source tables, transformations, freshness rules, and owners. Reuse means referencing the same, tested recipe across different models and pipelines rather than reinventing it.
Mental model
Think of each feature as a function:
(Entity, Event Time) → Value
If two models need the same function, they should call the same definition. If the function changes, you create a new version and keep the old one for backward compatibility.
Core elements of a good feature definition
- Name: Stable, descriptive, with a consistent prefix (entity) and suffix (time window, units). Example: user_30d_txn_count.
- Entity: The key the feature is computed for (user_id, card_id, item_id).
- Granularity: One row per (entity, event_time) or per entity snapshot time.
- Time semantics: Window size, lag, and the event_time column used.
- Sources: Raw tables/streams and join keys.
- Transform: Deterministic steps (aggregations, encodings) with null handling.
- Freshness / TTL: How often it updates and when it expires.
- Backfill: Historical recomputation rules.
- Versioning: user_30d_txn_count:v1, v2 for breaking changes.
- Ownership & description: Team, purpose, and contact.
- Quality checks: Simple tests (value ranges, monotonicity, non-null rate).
Naming pattern examples
- entity_window_metric_units: user_7d_avg_spend_usd
- entity_pointin_time_lag_feature: item_t-1d_price_usd
- entity_time_invariant_feature: merchant_category_code
Worked examples
Example 1: Churn features reused by marketing and support
Feature: user_30d_txn_count:v1 Entity: user_id Time: 30-day rolling window ending at event_time; source event_time = transaction_timestamp Sources: transactions(user_id, amount, transaction_timestamp) Transform: COUNT(*) of transactions within [event_time-30d, event_time] Freshness/TTL: recompute daily; TTL 36h Nulls: if no transactions → 0 Owner: Growth Data
Reuse: The same feature is used in a churn model and a marketing propensity model. Only one definition is maintained.
Example 2: Fraud features across card and device entities
Feature: card_1h_txn_count:v1 Entity: card_id Time: 1-hour rolling window Sources: auth_logs(card_id, auth_time) Transform: COUNT(*) in [event_time-1h, event_time] Freshness: every 5 minutes; TTL 10m (for online fraud scoring) Quality check: value spike guardrail (p99 over last day within 2x of prior day) Owner: Risk ML
Reuse: The feature feeds the real-time fraud model and a nightly risk dashboard without redefining logic.
Example 3: Ranking features from clicks and impressions
Feature: item_30d_ctr:v2 Entity: item_id Time: 30-day window using impression_time as event_time Sources: impressions(item_id, impression_time, clicked) Transform: SUM(clicked)/COUNT(*) over 30 days; add 1e-6 smoothing Freshness: hourly; TTL 3h Change from v1: switched to impression_time (fix leakage). v1 kept for legacy. Owner: Recsys
Reuse: The CTR feature is used by both the recommendation ranker and an ads-quality model. Versioning allows migration without breaking old consumers.
How to define features well (step-by-step)
- Choose the entity and event_time that reflect the prediction moment.
- Write the transform declaratively (window, filters, aggregations) and include null handling.
- Specify freshness, TTL, and backfill so both offline and online stores behave consistently.
- Attach ownership, description, and tests for accountability and trust.
- Version when breaking changes occur (new source, different time semantics, changed units).
Do and Don't examples
- Do: user_7d_avg_spend_usd clearly states window and units.
- Don't: avg_spend (ambiguous, not reusable across entities).
- Do: define event_time column and window.
- Don't: leave event_time implicit (risk of leakage).
Common mistakes and self-check
- Leakage: using future information. Self-check: Can I compute this at prediction time using only past data?
- Training-serving mismatch: different transforms offline vs. online. Self-check: Are both paths created from the same definition?
- Ambiguous names/units. Self-check: Would a new teammate infer window, units, and entity from the name?
- Missing owner/tests. Self-check: Is there a contact and at least one validation rule?
- Wrong granularity. Self-check: Is there exactly one value per (entity, event_time) as intended?
- Unbounded TTL. Self-check: Will stale values persist longer than allowed for decisions?
Exercises
Do these in a text editor or notebook. Keep definitions concise and precise.
Exercise 1 — Define reusable spend features
Goal: Create two features for user-level spend from a transactions table.
Data snapshot
transactions(user_id, amount_usd, transaction_timestamp)
Define:
- user_30d_total_spend_usd
- user_7d_txn_count
For each, include: entity, event_time, window, sources, transform, null handling, freshness, TTL, backfill, owner, version.
- Checklist:
- Window and event_time clearly stated
- Transform unambiguous
- Freshness/TTL feasible for both offline and online
- Owner and version present
Exercise 2 — Plan feature reuse across models
You support three models: churn, LTV, and marketing propensity. Candidate features:
- user_30d_total_spend_usd
- user_7d_txn_count
- user_country_code
- user_first_purchase_age_days
Plan:
- Which features become a shared user_activity bundle?
- Which are static profile features?
- How would you version if user_country_code source changes?
- Checklist:
- No duplication across models
- Clear bundle boundaries (activity vs. profile)
- Versioning rule articulated
Mini challenge
Pick any feature you defined before. Propose a v2 change that improves quality (e.g., different window, better smoothing). Explain why it is breaking or non-breaking and how you would roll it out while keeping existing consumers stable.
Who this is for
- Machine Learning Engineers and Data Scientists building production models.
- Data Engineers supporting feature pipelines.
Prerequisites
- Comfort with SQL aggregations and windowing.
- Basic understanding of training-serving skew and data leakage.
Learning path
- Before: Data sources and event_time fundamentals; point-in-time correctness.
- Now: Feature definitions and reuse (this lesson).
- Next: Materialization strategies, monitoring, and feature lifecycle management.
Practical projects
- Project 1: Define a user_activity bundle with 3 rolling-window features and backfill one year of history.
- Project 2: Create v2 of a feature with changed window; run an A/B evaluation while keeping v1 for legacy consumers.
- Project 3: Write simple validation tests (range checks, non-null rate) that run during materialization.
Next steps
- Adopt a naming convention across teams.
- Add ownership and tests to your top 10 features.
- Plan versioning for any upcoming breaking changes.
Quick Test
Take the quick test to check your understanding. Anyone can take it for free; saved progress is available to logged-in users.