How to learn Feature Generation Steps for ML Training And Batch Pipelines in MLOps Engineer for free

Why this matters

MLOps Engineers make features reliable, reproducible, and fast. Good features are the backbone of training pipelines and batch scoring. Your daily tasks include building point-in-time correct aggregates, handling late data, preventing leakage, and materializing features for both training and inference with identical logic.

Own feature definitions as code and version them.
Backfill features for historical training windows safely.
Detect and fix training–serving skew.
Monitor freshness, drift, and null spikes.

Mini task: Spot the risk

You compute "7-day purchases" using the latest table snapshot for both training and inference. Risk: data leakage in training because historical snapshots differed. Fix: compute strictly up to the training timestamp.

Concept explained simply

A feature is a computed column that summarizes raw data into a machine-usable signal. In batch pipelines, features are reproducible transformations with a time boundary.

Mental model: a kitchen with clear recipes. Each feature has a recipe card (contract):

Name and business meaning
Entities/keys (e.g., user_id, card_id)
Time semantics (event_time, windows, timezone)
Definition (SQL/pseudocode), null handling, units
Validation checks and owner

Core steps for feature generation

1) Define entities and timestamps

Choose primary keys and the event_time used for point-in-time correctness.

2) Collect and clean raw data

Deduplicate, standardize types/timezones, fill or flag missing fields.

3) Join safely

Join on keys with time constraints; avoid looking ahead of the reference time.

4) Aggregate/encode

Rolling windows, counts, sums, ratios, one-hot/embeddings, text/numeric transforms.

5) Enforce time boundaries

All computations must use event_time <= reference_time (for training points or batch scoring time).

6) Impute and normalize

Apply stable imputation and scaling. Store parameters derived only from training data.

7) Validate and document

Run schema and value checks; version the definition and keep a changelog.

8) Materialize

Write features to a registry/feature store with keys, timestamps, and backfill as needed.

9) Monitor

Track freshness, null rates, distribution drift, and compute latency.

Worked examples

Example 1: E-commerce — user_7d_orders_count (point-in-time)

Tables:
  orders(user_id, order_id, amount, order_ts)
  training_points(user_id, reference_ts)

Goal:
  For each training point, count orders in (reference_ts - 7d, reference_ts].

SQL (conceptual):
SELECT t.user_id,
       t.reference_ts,
       COUNT(o.order_id) AS user_7d_orders_count
FROM training_points t
LEFT JOIN orders o
  ON o.user_id = t.user_id
 AND o.order_ts > t.reference_ts - INTERVAL '7 day'
 AND o.order_ts <= t.reference_ts
GROUP BY t.user_id, t.reference_ts;

Notes: No future orders included; handles users with zero orders via LEFT JOIN.

Example 2: Fraud — card_24h_amount_sum (hourly materialization)

Tables:
  transactions(card_id, amount, event_ts)

Approach:
  1) Bucket by hour: bucket_ts = date_trunc('hour', event_ts)
  2) For each card_id and hour, compute rolling 24h sum at the end of the hour.

SQL (conceptual):
WITH hourly AS (
  SELECT card_id,
         date_trunc('hour', event_ts) AS hour_ts,
         SUM(amount) AS amt_hour
  FROM transactions
  GROUP BY 1,2
)
SELECT h1.card_id,
       h1.hour_ts AS reference_ts,
       (
         SELECT SUM(h2.amt_hour)
         FROM hourly h2
         WHERE h2.card_id = h1.card_id
           AND h2.hour_ts > h1.hour_ts - INTERVAL '24 hours'
           AND h2.hour_ts <= h1.hour_ts
       ) AS card_24h_amount_sum
FROM hourly h1;

Notes: Materialize per hour to a feature table keyed by (card_id, reference_ts).

Example 3: Text — title features (length, ratio, simple embedding)

Input:
  items(item_id, title_text, created_ts)

Outputs (per item_id, created_ts):
  - title_len_chars
  - title_alpha_ratio = letters / total_chars
  - title_embed_16 (placeholder: 16-dim avg of character code buckets)

Pseudocode:
for row in items:
  t = normalize_unicode(row.title_text)
  L = len(t)
  alpha = count_regex(t, '[A-Za-z]')
  alpha_ratio = safe_div(alpha, L)
  embed = simple_avg_bucket_embedding(t, dim=16)
  emit(item_id=row.item_id, ts=row.created_ts,
       title_len_chars=L,
       title_alpha_ratio=alpha_ratio,
       title_embed_16=embed)

Notes: Ensure deterministic normalization; store exact pre-processing rules.

Design decisions that matter

Granularity: per-event, per-entity-per-time bucket (hour/day), or snapshot. Finer granularity increases storage but reduces leakage risk.
Window size: shorter windows capture recency, longer windows improve stability. Consider multiple windows (7d, 30d).
Imputation: constant vs model-based. Keep it stable between training and inference.
Materialization cadence: balance latency (freshness) vs cost.

Time alignment and leakage

Rule: Every feature must only use data whose event_time <= reference_time of the row being scored/trained.

Late-arriving data: either reprocess with watermarking or accept eventual consistency.
Backfills: always recompute using historical snapshots or event logs, not today’s corrected state.
Labels: compute labels with a lookahead window that starts strictly after reference_time.

Self-check: Is this definition safe?

Feature says “orders last 7 days relative to today()”. Unsafe for training. Must reference the per-row reference_time, not wall-clock now.

Feature quality checks

Schema: types, ranges, allowed null rates.
Statistical: distribution drift vs baseline, min/max sanity, cardinality caps.
Freshness: max(ts_now - latest_materialized_ts) per feature.
Skew: training vs inference distribution differences.
Join coverage: rate of missing joins by entity and time.

Checklist — run before shipping

Point-in-time filter verified
Null handling documented and tested
Windows and timezones clearly stated
Backfill reproducible
Unit tests with fixed fixtures
Monitoring alerts configured

Implementation patterns (batch + serving)

Single source-of-truth transformation: share the same logic between training backfills and batch inference.
Windowed aggregates: precompute per entity/time-bucket for speed, then roll up at scoring.
Idempotent writes: deterministic keys (entity, reference_ts, version) allow safe retries.

Monitoring features in production

Freshness SLOs (e.g., 95% of hourly aggregates ready by T+20m).
Null/zero spikes detection and alerting.
Distribution drift: compare daily histograms to a moving baseline.
Latency and cost per job run; detect regressions after definition changes.

Exercises

Do these to lock in the concepts. Anyone can do them; if you log in, your progress will be saved.

Exercise 1 — Point-in-time SQL feature
Design a query for a 14-day rolling unique_items_count per user at a given reference_ts. See full instructions below in Exercises section.
Exercise 2 — Feature spec + validation
Write a feature definition (YAML-like) and create validation rules for nulls, ranges, and freshness.

Exercise 1 details

Match the instructions in the Exercises panel below (ID: ex1). Implement the SQL and verify the expected output.

Exercise 2 details

Match the instructions in the Exercises panel below (ID: ex2). Produce a spec and example validation checks.

Common mistakes and how to self-check

Using now() in training features. Fix: reference per-row timestamp only.
Joining on latest snapshot without time filter. Fix: add event_time <= reference_time in join condition.
Forgetting timezone normalization. Fix: convert all timestamps to UTC and document it.
Leaky imputation (using global stats from full history). Fix: compute stats on training window only and version them.
Unbounded cardinality features (e.g., IDs one-hot). Fix: hash, bucket, or top-K with OOV bucket.

Self-check prompt

Can I reproduce the same feature value for a given (entity, reference_ts) today and next month? If not, identify which step is nondeterministic.

Practical projects

Retail cohort features: 7/30/90-day spend and visit counts per user with a backfilled training set.
Fraud hourly rollups: 1h/6h/24h card activity features with lateness watermark and monitoring.
Content features: title/body text lengths, ratios, and simple embeddings materialized daily.

Who this is for

MLOps Engineers implementing training and batch inference pipelines.
Data Engineers collaborating on feature stores and ETL.
ML Engineers needing reproducible feature definitions.

Prerequisites

Comfort with SQL windowing and joins
Basic Python or similar for transforms
Familiarity with batch scheduling and data partitioning

Learning path

Point-in-time correctness basics
Windowed aggregation patterns
Imputation and scaling consistency
Backfilling and versioning
Validation and monitoring

Next steps

Complete the two exercises and run the Quick Test below.
Add monitoring checks to a feature you already built.
Plan a backfill and a safe rollout of a new feature version.

Mini challenge

You must compute user_30d_avg_order_value for batch scoring at 02:00 UTC daily. Yesterday, late orders arrived at 03:00. What do you do?

Answer hint: materialize at 02:00 with a 2-hour watermark (exclude 02:00–04:00 late data), then run a correction backfill for the affected window later and version the output.

Menu

Feature Generation Steps

Table of Contents

Why this matters

Concept explained simply

Core steps for feature generation

Worked examples

Design decisions that matter

Time alignment and leakage

Feature quality checks

Implementation patterns (batch + serving)

Monitoring features in production

Exercises

Common mistakes and how to self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Design a point-in-time correct 14-day unique_items_count

Instructions

Expected Output

Write a feature spec and validation checks

Feature Generation Steps — Quick Test

Have questions about Feature Generation Steps?

AI Assistant