luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Feature Generation Steps

Learn Feature Generation Steps for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Why this matters

MLOps Engineers make features reliable, reproducible, and fast. Good features are the backbone of training pipelines and batch scoring. Your daily tasks include building point-in-time correct aggregates, handling late data, preventing leakage, and materializing features for both training and inference with identical logic.

  • Own feature definitions as code and version them.
  • Backfill features for historical training windows safely.
  • Detect and fix training–serving skew.
  • Monitor freshness, drift, and null spikes.
Mini task: Spot the risk

You compute "7-day purchases" using the latest table snapshot for both training and inference. Risk: data leakage in training because historical snapshots differed. Fix: compute strictly up to the training timestamp.

Concept explained simply

A feature is a computed column that summarizes raw data into a machine-usable signal. In batch pipelines, features are reproducible transformations with a time boundary.

Mental model: a kitchen with clear recipes. Each feature has a recipe card (contract):

  • Name and business meaning
  • Entities/keys (e.g., user_id, card_id)
  • Time semantics (event_time, windows, timezone)
  • Definition (SQL/pseudocode), null handling, units
  • Validation checks and owner

Core steps for feature generation

1) Define entities and timestamps

Choose primary keys and the event_time used for point-in-time correctness.

2) Collect and clean raw data

Deduplicate, standardize types/timezones, fill or flag missing fields.

3) Join safely

Join on keys with time constraints; avoid looking ahead of the reference time.

4) Aggregate/encode

Rolling windows, counts, sums, ratios, one-hot/embeddings, text/numeric transforms.

5) Enforce time boundaries

All computations must use event_time <= reference_time (for training points or batch scoring time).

6) Impute and normalize

Apply stable imputation and scaling. Store parameters derived only from training data.

7) Validate and document

Run schema and value checks; version the definition and keep a changelog.

8) Materialize

Write features to a registry/feature store with keys, timestamps, and backfill as needed.

9) Monitor

Track freshness, null rates, distribution drift, and compute latency.

Worked examples

Example 1: E-commerce — user_7d_orders_count (point-in-time)
Tables:
  orders(user_id, order_id, amount, order_ts)
  training_points(user_id, reference_ts)

Goal:
  For each training point, count orders in (reference_ts - 7d, reference_ts].

SQL (conceptual):
SELECT t.user_id,
       t.reference_ts,
       COUNT(o.order_id) AS user_7d_orders_count
FROM training_points t
LEFT JOIN orders o
  ON o.user_id = t.user_id
 AND o.order_ts > t.reference_ts - INTERVAL '7 day'
 AND o.order_ts <= t.reference_ts
GROUP BY t.user_id, t.reference_ts;

Notes: No future orders included; handles users with zero orders via LEFT JOIN.

Example 2: Fraud — card_24h_amount_sum (hourly materialization)
Tables:
  transactions(card_id, amount, event_ts)

Approach:
  1) Bucket by hour: bucket_ts = date_trunc('hour', event_ts)
  2) For each card_id and hour, compute rolling 24h sum at the end of the hour.

SQL (conceptual):
WITH hourly AS (
  SELECT card_id,
         date_trunc('hour', event_ts) AS hour_ts,
         SUM(amount) AS amt_hour
  FROM transactions
  GROUP BY 1,2
)
SELECT h1.card_id,
       h1.hour_ts AS reference_ts,
       (
         SELECT SUM(h2.amt_hour)
         FROM hourly h2
         WHERE h2.card_id = h1.card_id
           AND h2.hour_ts > h1.hour_ts - INTERVAL '24 hours'
           AND h2.hour_ts <= h1.hour_ts
       ) AS card_24h_amount_sum
FROM hourly h1;

Notes: Materialize per hour to a feature table keyed by (card_id, reference_ts).

Example 3: Text — title features (length, ratio, simple embedding)
Input:
  items(item_id, title_text, created_ts)

Outputs (per item_id, created_ts):
  - title_len_chars
  - title_alpha_ratio = letters / total_chars
  - title_embed_16 (placeholder: 16-dim avg of character code buckets)

Pseudocode:
for row in items:
  t = normalize_unicode(row.title_text)
  L = len(t)
  alpha = count_regex(t, '[A-Za-z]')
  alpha_ratio = safe_div(alpha, L)
  embed = simple_avg_bucket_embedding(t, dim=16)
  emit(item_id=row.item_id, ts=row.created_ts,
       title_len_chars=L,
       title_alpha_ratio=alpha_ratio,
       title_embed_16=embed)

Notes: Ensure deterministic normalization; store exact pre-processing rules.

Design decisions that matter

  • Granularity: per-event, per-entity-per-time bucket (hour/day), or snapshot. Finer granularity increases storage but reduces leakage risk.
  • Window size: shorter windows capture recency, longer windows improve stability. Consider multiple windows (7d, 30d).
  • Imputation: constant vs model-based. Keep it stable between training and inference.
  • Materialization cadence: balance latency (freshness) vs cost.

Time alignment and leakage

Rule: Every feature must only use data whose event_time <= reference_time of the row being scored/trained.

  • Late-arriving data: either reprocess with watermarking or accept eventual consistency.
  • Backfills: always recompute using historical snapshots or event logs, not today’s corrected state.
  • Labels: compute labels with a lookahead window that starts strictly after reference_time.
Self-check: Is this definition safe?

Feature says “orders last 7 days relative to today()”. Unsafe for training. Must reference the per-row reference_time, not wall-clock now.

Feature quality checks

  • Schema: types, ranges, allowed null rates.
  • Statistical: distribution drift vs baseline, min/max sanity, cardinality caps.
  • Freshness: max(ts_now - latest_materialized_ts) per feature.
  • Skew: training vs inference distribution differences.
  • Join coverage: rate of missing joins by entity and time.
Checklist — run before shipping
  • Point-in-time filter verified
  • Null handling documented and tested
  • Windows and timezones clearly stated
  • Backfill reproducible
  • Unit tests with fixed fixtures
  • Monitoring alerts configured

Implementation patterns (batch + serving)

  • Single source-of-truth transformation: share the same logic between training backfills and batch inference.
  • Windowed aggregates: precompute per entity/time-bucket for speed, then roll up at scoring.
  • Idempotent writes: deterministic keys (entity, reference_ts, version) allow safe retries.

Monitoring features in production

  • Freshness SLOs (e.g., 95% of hourly aggregates ready by T+20m).
  • Null/zero spikes detection and alerting.
  • Distribution drift: compare daily histograms to a moving baseline.
  • Latency and cost per job run; detect regressions after definition changes.

Exercises

Do these to lock in the concepts. Anyone can do them; if you log in, your progress will be saved.

  1. Exercise 1 — Point-in-time SQL feature
    Design a query for a 14-day rolling unique_items_count per user at a given reference_ts. See full instructions below in Exercises section.
  2. Exercise 2 — Feature spec + validation
    Write a feature definition (YAML-like) and create validation rules for nulls, ranges, and freshness.
Exercise 1 details

Match the instructions in the Exercises panel below (ID: ex1). Implement the SQL and verify the expected output.

Exercise 2 details

Match the instructions in the Exercises panel below (ID: ex2). Produce a spec and example validation checks.

Common mistakes and how to self-check

  • Using now() in training features. Fix: reference per-row timestamp only.
  • Joining on latest snapshot without time filter. Fix: add event_time <= reference_time in join condition.
  • Forgetting timezone normalization. Fix: convert all timestamps to UTC and document it.
  • Leaky imputation (using global stats from full history). Fix: compute stats on training window only and version them.
  • Unbounded cardinality features (e.g., IDs one-hot). Fix: hash, bucket, or top-K with OOV bucket.
Self-check prompt

Can I reproduce the same feature value for a given (entity, reference_ts) today and next month? If not, identify which step is nondeterministic.

Practical projects

  • Retail cohort features: 7/30/90-day spend and visit counts per user with a backfilled training set.
  • Fraud hourly rollups: 1h/6h/24h card activity features with lateness watermark and monitoring.
  • Content features: title/body text lengths, ratios, and simple embeddings materialized daily.

Who this is for

  • MLOps Engineers implementing training and batch inference pipelines.
  • Data Engineers collaborating on feature stores and ETL.
  • ML Engineers needing reproducible feature definitions.

Prerequisites

  • Comfort with SQL windowing and joins
  • Basic Python or similar for transforms
  • Familiarity with batch scheduling and data partitioning

Learning path

  1. Point-in-time correctness basics
  2. Windowed aggregation patterns
  3. Imputation and scaling consistency
  4. Backfilling and versioning
  5. Validation and monitoring

Next steps

  • Complete the two exercises and run the Quick Test below.
  • Add monitoring checks to a feature you already built.
  • Plan a backfill and a safe rollout of a new feature version.

Mini challenge

You must compute user_30d_avg_order_value for batch scoring at 02:00 UTC daily. Yesterday, late orders arrived at 03:00. What do you do?

  • Answer hint: materialize at 02:00 with a 2-hour watermark (exclude 02:00–04:00 late data), then run a correction backfill for the affected window later and version the output.

Practice Exercises

2 exercises to complete

Instructions

You have tables:

  • events(user_id, item_id, event_ts)
  • training_points(user_id, reference_ts)

Goal: For each row in training_points, compute unique_items_count_14d = count of distinct item_id where event_ts is in (reference_ts - 14 days, reference_ts].

Write a single SQL-like query (conceptual is fine) that is point-in-time correct and returns (user_id, reference_ts, unique_items_count_14d).

Assume timestamps are UTC and properly typed.

Expected Output
A query that joins events to training_points on user_id and filters events where event_ts > reference_ts - 14 days AND event_ts <= reference_ts, grouping by user_id, reference_ts, with COUNT(DISTINCT item_id).

Feature Generation Steps — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Feature Generation Steps?

AI Assistant

Ask questions about this tool